Skip to main content

Full text of "Perceptrons"

See other formats


Expanded Edition 


Perceptrons 


Marvin L. Minsky 
Seymour A. Papert 



Marvin Minsky and Seymour Papert 


Perceptrons 

An Introduction to Computational Geometry 


Expanded Edition 


' PHD illlBisiff Of 

OH AH RAN -31261 


UJDI 


EMU 


ARABIA 



X, 


YY\emor 1 


If Rosen IjUft 



// 


r 


MS 5' 
II b> v 


Third printing, 1988 

Copyright 1969 Massachusetts Institute of Technology. 
Handwritten alterations were made by the authors for the 
second printing (1972). Preface and epilogue copyright 
1988 Marvin Minsky and Seymour Papert. Expanded edi- 
tion, 1988. 

All rights reserved. No part of this book may be repro- 
duced in any form by any electronic or mechanical means 
(including photocopying, recording, or information stor- 
age and retrieval) without permission in writing from the 
publisher. 

This book was set in Photon Times Roman by The Science 
Press, Inc., and printed and bound by Halliday Lithograph 
in the United States of America. 

Library of Congress Cataloging-in-Publication Data 

Minsky, Marvin Lee, 1927— 

Perceptrons : An introduction to computational 
geometry. 

Bibliography: p. 

Includes index. 

1. Perceptrons. 2. Geometry — Data processing. 

3. Parallel processing (Electronic computers) 

4. Machine learning. I. Papert, Seymour. II. Title. 

Q327.M55 1988 006.3 87-30990 

ISBN 0-262-63111-3 (pbk.) 



CONTENTS 


Prologue: A View from 1988 vii 

0 Introduction 1 

i 

Algebraic Theory of Linear Parallel Predicates 21 

1 Theory of Linear Boolean Inequalities 25 

2 Group Invariance of Boolean Inequalities 39 

3 Parity and One-in-a-box Predicates 56 

4 The “And/Or” Theorem 62 

II 

Geometric Theory of Linear Inequalities 69 

5 ib : A Geometric Property 

“connected r J 

with Unbounded Order 73 

6 Geometric Patterns of Small Order: Spectra and Context 96 

7 Stratification and Normalization 114 

8 The Diameter-limited Perception 129 

9 Geometric Predicates and Serial Algorithms 136 

hi 

Learning Theory 149 

10 Magnitude of the Coefficients 151 

11 Learning 161 

12 Linear Separation and Learning 188 

13 Perceptions and Pattern Recognition 227 

Epilogue: The New Connectionism 247 

Bibliographic Notes 281 

Index 295 




Prologue: A View from 1988 


This book is about perceptrons — the simplest learning machines. 
However, our deeper purpose is to gain more general insights into 
the interconnected subjects of parallel computation, pattern recog- 
nition, knowledge representation, and learning. It is only because 
one cannot think productively about such matters without studying 
specific examples that we focus on theories of perceptrons. 

In preparing this edition we were tempted to “bring those theories 
up to date.” But when we found that little of significance had 
changed since 1969, when the book was first published, we con- 
cluded that it would be more useful to keep the original text (with 
its corrections of 1972) and add an epilogue, so that the book could 
still be read in its original form. One reason why progress has been 
so slow in this field is that researchers unfamiliar with its history 
have continued to make many of the same mistakes that others 
have made before them. Some readers may be shocked to hear it 
said that little of significance has happened in this field. Have not 
perceptron-like networks — under the new name connectionism — 
become a major subject of discussion at gatherings of psycholo- 
gists and computer scientists? Has not there been a “connectionist 
revolution?” Certainly yes, in that there is a great deal of interest 
and discussion. Possibly yes, in the sense that discoveries have 
been made that may, in time, turn out to be of fundamental impor- 
tance. But certainly no, in that there has been little clear-cut 
change in the conceptual basis of the field., The issues that give rise 
to excitement today seem much the same as those that were re- 
sponsible for previous rounds of excitement. The issues that were 
then obscure remain obscure today because no one yet knows how 
to tell which of the present discoveries are fundamental and which 
are superficial. Our position remains what it was when we wrote 
the book: We believe this realm of work to be immensely important 
and rich, but we expect its growth to require a degree of critical 
analysis that its more romantic advocates have always been reluc- 
tant to pursue — perhaps because the spirit of connectionism seems 
itself to go somewhat against the grain of analytic rigor. 

In the next few pages we will try to portray recent events in the 
field of parallel-network learning machines as taking place within 
the historical context of a war between antagonistic tendencies 
called symbolist and connectionist. Many of the participants in this 
history see themselves as divided on the question of strategies for 



[viii] Prologue 


thinking — a division that now seems to pervade our culture, engag- 
ing not only those interested in building models of mental functions 
but also writers, educators, therapists, and philosophers. Too 
many people too often speak as though the strategies of thought fall 
naturally into two groups whose attributes seem diametrically op- 
posed in character: 


symbolic 

logical 

serial 

discrete 

localized 

hierarchical 

left-brained 


connectionist 

analogical 

parallel 

continuous 

distributed 

heterarchical 

right-brained 


This broad division makes no sense to us, because these attributes 
are largely independent of one another; for example, the very same 
system could combine symbolic, analogical, serial, continuous, 
and localized aspects. Nor do many of those pairs imply clear 
opposites; at best they merely indicate some possible extremes 
among some wider range of possibilities. And although many good 
theories begin by making distinctions, we feel that in subjects as 
broad as these there is less to be gained from sharpening bound- 
aries than from seeking useful intermediates. 


The 1940s: Neural Networks 

The 1940s saw the emergence of the simple yet powerful concept 
that the natural components of mind-like machines were simple 
abstractions based on the behavior of biological nerve cells, and 
that such machines could be built by interconnecting such ele- 
ments. In their 1943 manifesto “A Logical Calculus of the Ideas 
Immanent in Nervous Activity,” Warren McCulloch and Walter 
Pitts presented the first sophisticated discussion of “neuro-logical 
networks,” in which they combined new ideas about finite-state 
machines, linear threshold decision elements, and logical represen- 
tations of various forms of behavior and memory. In 1947 they 
published a second monumental essay, “How We Know Univer- 
sal, ” which described network architectures capable, in princi- 
ple, of recognizing spatial patterns in a manner invariant under 
groups of geometric transformations. 



Prologue [ix] 


From such ideas emerged the intellectual movement called cyber- 
netics, which attempted to combine many concepts from biology, 
psychology, engineering, and mathematics. The cybernetics era 
produced a flood of architectural schemes for making neural net- 
works recognize, track, memorize, and perform many other useful 
functions. The decade ended with the publication of Donald 
Hebb’s book The Organization of Behavior, the first attempt to 
base a large-scale theory of psychology on conjectures about 
neural networks. The central idea of Hebb’s book was that such 
networks might learn by constructing internal representations of 
concepts in the form of what Hebb called “cell-assemblies” — 
subfamilies of neurons that would learn to support one another’s 
activities. There had been earlier attempts to explain psychology in 
terms of “connections” or “associations,” but (perhaps because 
those connections were merely between symbols or ideas rather 
than between mechanisms) those theories seemed then too insub- 
stantial to be taken seriously by theorists seeking models for men- 
tal mechanisms. Even after Hebb’s proposal, it was years before 
research in artificial intelligence found suitably convincing ways to 
make symbolic concepts seem concrete. 

The 1950s: Learning in Neural Networks 

The cybernetics era opened up the prospect of making mind-like 
machines. The earliest workers in that field sought specific archi- 
tectures that could perform specific functions. However, in view of 
the fact that animals can learn to do many things they aren’t built to 
do, the goal soon changed to making machines that could learn. 
Now, the concept of learning is ill defined, because there is no 
clear-cut boundary between the simplest forms of memory and 
complex procedures for making predictions and generalizations 
about things never seen. Most of the early experiments were based 
on the idea of “reinforcing” actions that had been successful in the 
past — a concept already popular in behavioristic psychology. In 
order for reinforcement to be applied to a system, the system must 
be capable of generating a sufficient variety of actions from which 
to choose and the system needs some criterion of relative success. 
These are also the prerequisites for the “hill-climbing” machines 
that we discuss in section 11.6 and in the epilogue. 

Perhaps the first reinforcement-based network learning system was 
a machine built by Minsky in 1951. It consisted of forty electronic 



[x] Prologue 


units interconnected by a network of links, each of which had an 
adjustable probability of receiving activation signals and then 
transmitting them to other units. It learned by means of a reinforce- 
ment process in which each positive or negative judgment about 
the machine’s behavior was translated into a small change (of cor- 
responding magnitude and sign) in the probabilities associated with 
whichever connections had recently transmitted signals. The 1950s 
saw many other systems that exploited simple forms of learning, 
and this led to a professional specialty called adaptive control. 

Today, people often speak of neural networks as offering a promise 
of machines that do not need to be programmed. But speaking of 
those old machines in such a way stands history on its head, since 
the concept of programming had barely appeared at that time. 
When modern serial computers finally arrived, it became a great 
deal easier to experiment with learning schemes and “self- 
organizing systems.” However, the availability of computers also 
opened up other avenues of research into learning. Perhaps the 
most notable example of this was Arthur Samuel’s research on 
programming computers to learn to play checkers. (See Biblio- 
graphic Notes.) Using a success-based reward system, Samuel’s 
1959 and 1967 programs attained masterly levels of performance. 
In developing those procedures, Samuel encountered and de- 
scribed two fundamental questions: 

Credit assignment. Given some existing ingredients, how does one 
decide how much to credit each of them for each of the machine’s 
accomplishments? In Samuel’s machine, the weights are assigned 
by correlation with success. 

Inventing novel predicates. If the existing ingredients are inade- 
quate, how does one invent new ones? Samuel’s machine tests 
products of some preexisting terms. 

Most researchers tried to bypass these questions, either by ignor- 
ing them or by using brute force or by trying to discover powerful 
and generally applicable methods. Few researchers tried to use 
them as guides to thoughtful research. We do not believe that any 
completely general solution to them can exist, and we argue in our 
epilogue that awareness of these issues should lead to a model of 
mind that can accumulate a multiplicity of specialized methods. 



Prologue [xi] 


By the end of the 1950s, the field of neural-network research had 
become virtually dormant. In part this was because there had not 
been many important discoveries for a long time. But it was also 
partly because important advances in artificial intelligence had 
been made through the use of new kinds of models based on serial 
processing of symbolic expressions. New landmarks appeared in 
the form of working computer programs that solved respectably 
difficult problems. In the wake of these accomplishments, theories 
based on connections among symbols suddenly seemed more satis- 
factory. And although we and some others maintained allegiances 
to both approaches, intellectual battle lines began to form along 
such conceptual fronts as parallel versus serial processing, learn- 
ing versus programming, and emergence versus analytic descrip- 
tion. 

The 1960s: Connectionists and Symbolists 

Interest in connectionist networks revived dramatically in 1962 
with the publication of Frank Rosenblatt’s book Principles of 
Neurodynamics, in which he defined the machines he named per- 
ceptrons and proved many theories about them. (See Bibliographic 
Notes.) The basic idea was so simply and clearly defined that it was 
feasible to prove an amazing theorem (theorem 11.1 below) which 
stated that a perceptron would learn to do anything that it was 
possible to program it to do. And the connectionists of the 1960s 
were indeed able to make perceptrons learn to do certain things — 
but not other things. Usually, when a failure occurred, neither 
prolonging the training experiments nor building larger machines 
helped. All perceptrons would fail to learn to do those things, and 
once again the work in this field stalled. 

Arthur Samuel’s two questions can help us see why perceptrons 
worked as well as they did. First, Rosenblatt’s credit-assignment 
method turned out to be as effective as any such method could be. 
When the answer is obtained, in effect, by adding up the contribu- 
tions of many processes that have no significant interactions among 
themselves, then the best one can do is reward them in proportion 
to how much each of them contributed. (Actually, with percep- 
trons, one never rewards success; one only punishes failure.) And 
Rosenblatt offered the simplest possible approach to the problem 
of inventing new parts: You don’t have to invent new parts if 



[xii] Prologue 


enough parts are provided from the start. Once it became clear that 
these tactics would work in certain circumstances but not in 
others, most workers searched for methods that worked in general. 
However, in our book we turned in a different direction. Instead of 
trying to find a method that would work in every possible situation, 
we sought to find ways to understand why the particular method 
used in the perceptron would succeed in some situations but not in 
others. It turned out that perceptrons could usually solve the types 
of problems that we characterize (in section 0.8) as being of low 
“order.” With those problems, one can indeed sometimes get by 
with making ingredients at random and then selecting the ones that 
work. However, problems of higher “orders” require too many 
such ingredients for this to be feasible. 

This style of analysis was the first to show that there are fundamen- 
tal limitations on the kinds of patterns that perceptrons can ever 
learn to recognize. How did the scientists involved with such mat- 
ters react to this? One popular version is that the publication of our 
book so discouraged research on learning in network machines that 
a promising line of research was interrupted. Our version is that 
progress had already come to a virtual halt because of the lack of 
adequate basic theories, and the lessons in this book provided the 
field with new momentum — albeit, paradoxically, by redirecting its 
immediate concerns. To understand the situation, one must recall 
that by the mid 1960s there had been a great many experiments with 
perceptrons, but no one had been able to explain why they were 
able to learn to recognize certain kinds of patterns and not others. 
Was this in the nature of the learning procedures? Did it depend on 
the sequences in which the patterns were presented? Were the 
machines simply too small in capacity? 

What we discovered was that the traditional analysis of learning 
machines — and of perceptrons in particular — had looked in the 
wrong direction. Most theorists had tried to focus only on the 
mathematical structure of what was common to all learning, and 
the theories to which this had led were too general and too weak to 
explain which patterns perceptrons could learn to recognize. As 
our analysis in chapter 2 shows, this actually had nothing to with 
learning at all; it had to do with the relationships between the 
perceptron’s architecture and the characters of the problems that 
were being presented to it. The trouble appeared when perceptrons 



Prologue [xiii] 


had no way to represent the knowledge required for solving certain 
problems. The moral was that one simply cannot learn enough by 
studying learning by itself; one also has to understand the nature of 
what one wants to learn. This can be expressed as a principle that 
applies not only to perceptrons but to every sort of learning ma- 
chine: No machine can learn to recognize X unless it possesses, at 
least potentially, some scheme for representing X. 

The 1970s: Representation of Knowledge 

Why have so few discoveries about network machines been made 
since the work of Rosenblatt? It has sometimes been suggested that 
the “pessimism” of our book was responsible for the fact that 
connectionism was in a relative eclipse until recent research broke 
through the limitations that we had purported to establish. Indeed, 
this book has been described as having been intended to demon- 
strate that perceptrons (and all other network machines) are too 
limited to deserve further attention. Certainly many of the best 
researchers turned away from network machines for quite some 
time, but present-day connectionists who regard that as regrettable 
have failed to understand the place at which they stand in history. 
As we said earlier, it seems to us that the effect of Perceptrons was 
not simply to interrupt a healthy line of research. That redirection 
of concern was no arbitrary diversion; it was a necessary interlude. 
To make further progress, connectionists would have to take time 
off and develop adequate ideas about the representation of knowl- 
edge. In the epilogue we shall explain why that was a prerequisite 
for understanding more complex types of network machines. 

In any case, the 1970s became the golden age of a new field of 
research into the representation of knowledge. And it was not only 
connectionist learning that was placed on hold; it also happened 
to research on learning in the field of artificial intelligence. For 
example, although Patrick Winston’s 1970 doctoral thesis (see 
“Learning Structural Definitions from Examples,” in The Psychol- 
ogy of Computer Vision, ed. P. H. Winston [McGraw-Hill, 1975]) 
was clearly a major advance, the next decade of AI research saw 
surprisingly little attention to that subject. 

In several other related fields, many researchers set aside their 
interest in the study of learning in favor of examining the represen- 
tatipn of knowledge in many different contexts and forms. The 



[xiv] Prologue 


result was the very rapid development of many new and powerful 
ideas — among them frames, conceptual dependency, production 
systems, word-expert parsers, relational databases, K-lines, 
scripts, nonmonotonic logic, semantic networks, analogy genera- 
tors, cooperative processes, and planning procedures. These ideas 
about the analysis of knowledge and its embodiments, in turn, had 
strong effects not only in the heart of artificial intelligence but also 
in many areas of psychology, brain science, and applied expert 
systems. Consequently, although not all of them recognize this, a 
good deal of what young researchers do today is based on what was 
learned about the representation of knowledge since Perceptrons 
first appeared. As was asserted above, their not knowing that his- 
tory often leads them to repeat mistakes of the past. For example, 
many contemporary experimenters assume that, because the per- 
ceptron networks discussed in this book are not exactly the same 
as those in use today, these theorems no longer apply. Yet, as we 
will show in our epilogue, most of the lessons of the theorems still 
apply. 

The 1980s: The Revival of Learning Machines 

What could account for the recent resurgence of interest in net- 
work machines? What turned the tide in the battle between the 
connectionists and the symbolists? Was it that symbolic AI had run 
out of steam? Was it the important new ideas in connectionism? 
Was it the prospect of new, massively parallel hardware? Or did 
the new interest reflect a cultural turn toward holism? 

Whatever the answer, a more important question is: Are there 
inherent incompatibilities between those connectionist and sym- 
bolist views? The answer to that depends on the extent to which 
one regards each separate connectionist scheme as a self-standing 
system. If one were to ask whether any particular, homogeneous 
network could serve as a model for a brain, the answer (we claim) 
would be, clearly, No. But if we consider each such network as a 
possible model for a part of a brain, then those two overviews are 
complementary. 

This is why we see no reason to choose sides. We expect a great 
many new ideas to emerge from the study of symbol-based theories 
and experiments. And we expect the future of network-based 
learning machines to be rich beyond imagining. As we say in sec- 



Prologue [xv] 


tion 0.3, the solemn experts who complained most about the “ex- 
aggerated claims” of the cybernetics enthusiasts were, on balance, 
in the wrong. It is just as clear to us today as it was 20 years ago that 
the marvelous abilities of the human brain must emerge from the 
parallel activity of vast assemblies of interconnected nerve cells. 
But, as we explain in our epilogue, the marvelous powers of the 
brain emerge not from any single, uniformly structured connec- 
tionist network but from highly evolved arrangements of smaller, 
specialized networks which are interconnected in very specific 
ways. 

The movement of research interest between the poles of connec- 
tionist learning and symbolic reasoning may provide a fascinating 
subject for the sociology of science, but workers in those fields 
must understand that these poles are artificial simplifications. It 
can be most revealing to study neural nets in their purest forms, or 
to do the same with elegant theories about formal reasoning. Such 
isolated studies often help in the disentangling of different types of 
mechanisms, insights, and principles. But it never makes any sense 
to choose either of those two views as one’s only model of the 
mind. Both are partial and manifestly useful views of a reality of 
which science is still far from a comprehensive understanding. 





Introduction 

o 


0.0 Readers 

In writing this we had in mind three kinds of readers. First, there 
are many new results that will interest specialists concerned with 
“pattern recognition,” “learning machines,” and “threshold 
logic.” Second, some people will enjoy reading it as an essay in 
abstract mathematics; it may appeal especially to those who would 
like to see geometry return to topology and algebra. We ourselves 
share both these interests. But we would not have carried the work 
as far as we have, nor presented it in the way we shall, if it were 
not for a different, less clearly defined, set of interests. 

The goal of this study is to reach a deeper understanding of some 
concepts we believe are crucial to the general theory of computa- 
tion. We will study in great detail a class of computations that 
make decisions by weighing evidence. Certainly, this problem is 
of great interest in itself, but our real hope is that understanding 
of its mathematical structure will prepare us eventually to go 
further into the almost unexplored theory of parallel computers. 

The people we want most to speak to are interested in that general 
theory of computation. We hope this includes psychologists and 
biologists who would like to know how the brain computes 
thoughts and how the genetic program computes organisms. We 
do not pretend to give answers to such questions — nor even to 
propose that the simple structures we shall use should be taken as 
“models” for such processes. Our aim — we are not sure whether 
it is more modest or more ambitious — is to illustrate how such a 
theory might begin, and what strategies of research could lead to 
it. 

It is for this third class of readers that we have written this intro- 
duction. It may help those who do not have an immediate involve- 
ment with it to see that the theory of pattern recognition might be 
worth studying for other reasons. At the same time we will set out 
a simplified version of the theory to help readers who have not 
had the mathematical training that would make the later chapters 
easy to read. The rest of the book is self-contained and anyone 
who hates introductions may go directly to Chapter 1. 

0.1 Real, Abstract, and Mythological Computers 

We know shamefully little about our computers and their compu- 
tations. This seems paradoxical because, physically and logically, 



[2] 0.1 Introduction 


computers are so lucidly transparent in their principles of opera- 
tion. Yet even a school boy can ask questions about them that 
today’s “computer science” cannot answer. We know very little, 
for instance, about how much computation a job should require. 

As an example, consider one of the most frequently performed 
computations: solving a set of linear equations. This is important 
in virtually every kind of scientific work. There are a variety of 
standard programs for it, which are composed of additions, mul- 
tiplications, and divisions. One would suppose that such a simple 
and important subject, long studied by mathematicians, would by 
now be thoroughly understood. But we ask, How many arithme- 
tic steps are absolutely required? How does this depend on the 
amount of computer memory? How much time can we save if we 
have two (or n) identical computers? Every computer scientist 
“knows” that this computation requires something of the order of 
n 3 multiplications for n equations, but even if this be true no one 
knows — at this writing — how to begin to prove it. 

Neither the outsider nor the computation specialist seems to 
recognize how primitive and how empirical is our present state of 
understanding of such matters. We do not know how much the 
speed of computations can be increased, in general, by using 
“parallel” as opposed to “serial” — or “analog” as opposed to 
“digital” — machines. We have no theory of the situations in 
which “associative” memories will justify their higher cost as 
compared to “addressed” memories. There is a great deal of folk- 
lore about this sort of contrast, but much of this folklore is mere 
superstition; in the cases we have studied carefully, the common 
beliefs turn out to be not merely “unproved”; they are often 
drastically wrong. 

The immaturity shown by our inability to answer questions of this 
kind is exhibited even in the language used to formulate the ques- 
tions. Word pairs such as “parallel” vs. “serial;” “local” vs. 
“global,” and “digital” vs. “analog” are used as if they referred 
to well-defined technical concepts. Even when this is true, the 
technical meaning varies from user to user and context to con- 
text. But usually they are treated so loosely that the species of 
computing machine defined by them belongs to mythology rather 
than science. 



Introduction 0.2 [3] 


Now we do not mean to suggest that these are mere pseudo- 
problems that arise from sloppy use of language. This is not a 
book of “therapeutic semantics”! For there is much content in 
these intuitive ideas and distinctions. The problem is how to 
capture it in a clear, sharp theory. 

0.2 Mathematical Strategy 

We are not convinced that the time is ripe to attempt a very 
general theory broad enough to encompass the concepts we have 
mentioned and others like them. Good theories rarely develop 
outside the context of a background of well-understood real prob- 
lems and special cases. Without such a foundation, one gets either 
the vacuous generality of a theory with more definitions than 
theorems — or a mathematically elegant theory with no applica- 
tion to reality. 

Accordingly, our best course would seem to be to strive for a very 
thorough understanding of well-chosen particular situations in 
which these concepts are involved. 

We have chosen in fact to explore the properties of the simplest 
machines we could find that have a clear claim to be “parallel” — 
for they have no loops or feedback paths — yet can perform 
computations that are nontrivial, both in practical and in mathe- 
matical respects. 

Before we proceed into details, we would like to reassure non- 
mathematicians who might be frightened by what they have 
glimpsed in the pages ahead. The mathematical methods used are 
rather diverse, but they seldom require advanced knowledge. We 
explain most of that which goes beyond elementary algebra and 
geometry. Where this was not practical, we have marked as op- 
tional those sections we feel might demand from most readers 
more mathematical effort than is warranted by the topic’s role in 
the whole structure. Our theory is more like a tree with many 
branches than like a narrow high tower of blocks; in many cases 
one can skip, if trouble is encountered, to the beginning of the 
following chapter. 

The reader of most modern mathematical texts is made to work 
unduly hard by the authors’ tendency to cover over the intel- 
lectual tracks that lead to the discovery of the theorems. We have 



[4] 0.3 Introduction 


tried to leave visible the lines of progress. We should have liked 
to go further and leave traces of all the false tracks we followed; 
unfortunately there were too many! Nevertheless we have oc- 
casionally left an earlier proof even when we later found a 
“better” one. Our aim is not so much to prove theorems as to 
give insight into methods and to encourage research. We hope 
this will be read not as a chain of logical deductions but as a 
mathematical novel where characters appear, reappear, and 
develop. 


0.3 Cybernetics and Romanticism 

The machines we will study are abstract versions of a class of 
devices known under various names; we have agreed to use the 
name “perceptron” in recognition of the pioneer work of Frank 
Rosenblatt. Perceptrons make decisions — determine whether or 
not an event fits a certain “pattern” — by adding up evidence 
obtained from many small experiments. This clear and simple 
concept is important because most, and perhaps all, more com- 
plicated machines for making decisions share a little of this char- 
acter. Until we understand it very thoroughly, we can expect to 
have trouble with more advanced ideas. In fact, we feel that the 
critical advances in many branches of science and mathematics 
began with good formulations of the “linear” systems, and these 
machines are our candidate for beginning the study of “parallel 
machines” in general. 

Our discussion will include some rather sharp criticisms of earlier work 
in this area. Perceptrons have been widely publicized as “pattern recog- 
nition” or “learning” machines and as such have been discussed in a 
large number of books, journal articles, and voluminous “reports.” Most 
of this writing (some exceptions are mentioned in our bibliography) is 
without scientific value and we will not usually refer by name to the 
works we criticize. The sciences of computation and cybernetics began, 
and it seems quite rightly so, with a certain flourish of romanticism. They 
were laden with attractive and exciting new ideas which have already 
borne rich fruit. Heavy demands of rigor and caution could have held 
this development to a much slower pace; only the future could tell which 
directions were to be the best. We feel, in fact, that the solemn experts 
who most complained about the “exaggerated claims” of the cybernetic 
enthusiasts were, in the balance, much more in the wrong. But now the 
time has come for maturity, and this requires us to match our speculative 
enterprise with equally imaginative standards of criticism. 



Introduction 0.5 [5] 


0.4 Parallel Computation 

The simplest concept of parallel computation is represented by 
the diagram in Figure 0.1. The figure shows how one might com- 
pute a function \p(X) in two stages. First we compute inde- 
pendently of one another a set of functions \(X), <p 2 (X), 
<Pn{X) and then combine the results by means of a function 12 of n 
arguments to obtain the value of \p. 



To make the definition meaningful — or, rather, productive — one 
needs to place some restrictions on the function 12 and the set <f> of 
functions <p\, <p 2 , .... If we do not make restrictions, we do not get 
a theory: any computation \p could be represented as a parallel 
computation in various trivial ways, for example, by making one 
of the (p's be \p and letting 12 do nothing but transmit its result. 
We will consider a variety of restrictions, but first we will give a 
few concrete examples of the kinds of functions we might want 
to be. 

0.5 Some Geometric Patterns; Predicates 

Let R be the ordinary two-dimensional Euclidean plane and let X 
be a geometric figure drawn on R. X could be a circle, or a pair of 
circles, or a black-and-white sketch of a face. In general we will 
think of a figure X as simply a subset of the points of R (that is, 
the black points). 



[6] 0.5 Introduction 


Let \f/(X) be a function (of figures X on R) that can have but two 
values. We usually think of the two values of ^ as 0 and 1. But by 
taking them to be false and true we can think of \p(X) as a 
predicate, that is, a variable statement whose truth or falsity de- 
pends on the choice of X. We now give a few examples of predi- 
cates that will be of particular interest in the sequel. 


CIRCLE (-^0 — 


1 if the figure X is a circle, 

0 if the figure is not a circle; 



<Ac 


<(X) 


1 if X is a convex figure, 

0 if X is not a convex figure; 



* 


CONNECTED 



1 if X is a connected figure, 
0 otherwise. 







Introduction 0.6 [7] 


We will also use some very much simpler predicates.* The very 
simplest predicate “recognizes” when a particular single point is 
in X: let p be a point in the plane and define 


<P P (X) 


1 if p is in X , 
0 otherwise. 


Finally we will need the kind of predicate that tells when a par- 
ticular set A is a subset of X: 


<Pa(X) = 


1 if A C X, 
0 otherwise. 



0.6 One simple concept of “Local” 

We start by observing an important difference between ^connected 
and ^convex- To bring it out we state a fact about convexity: 

Definition: A set X fails to be convex if and only if there exist 
three points such that q is in the line segment joining p and /*, and 

{ p is in X , 
q is not in X , 
r is in X. 

Thus we can test for convexity by examining triplets of points. If 
all the triplets pass the test then X is convex; if any triplet fails 
(that is, meets all conditions above) then X is not convex. Be- 
cause all the tests can be done independently, and the final decision 
made by such a logically simple procedure— unanimity of all the 
tests — we propose this as a first draft of our definition of “local.” 


*We will use ‘V” instead of for those very simple predicates that will be 
combined later to make more complicated predicates. No absolute logical distinc- 
tion is implied. 




[8] 0.6 Introduction 


Definition: A predicate \p is conjunctively local of order k if it can 
be computed, as in §0.4, by a set 4> of predicates p such that 


f Each <p depends upon no more than k points of R\ 

WX) 

UlU' 

Example: ^convex ls conjunctively local of order 3. 


1 if <p(X) = 1 for every p in 4>, 
0 otherwise. 


The property of a figure being connected might not seem at first 
to be very different in kind from the property of being convex. 
Yet we can show that: 


Theorem 0.6.1: \p connected ls not conjunctively local of any order. 


proof: Suppose that \[/ 


CONNECTED 


has order k. 


iUese 4 * too - u>>( 


Then to distinguish 


and 


To X\ 

suck -HiaE %(Xo) = O i because 

there must be some p o is not con- 

nected. All p s have value 1 on J t , which is connected. Now, p 
can depend on at most k points, so there must be at least one 
middle square, say that does not contain one of these points. 
But then, on the figure X 2 , 


H 


m 


W/ 


W/ 



H 


n 



n 


n 

p 


m 

H 




which is connected, <^must have the same value, 0, that it has on 
X 0 . But this cannot be, for all ^’s must have value 1 on X 2 . 

Of course, if some <p is allowed to look at all the points of R then 
^connected can t> e computed, but this would go against any con- 
cept of the tp's as “local'’ functions. 



Introduction 0.7 [9] 


0.7 Some Other Concepts of Local 

We have accumulated some evidence in favor of “conjunctively 
local” as a geometrical and computationally meaningful property 
of predicates. But a closer look raises doubts about whether it is 
broad enough to lead to a rich enough theory. 

Readers acquainted with the mathematical methods of topology 
will have observed that “conjunctively local” is similar to the 
notion of “local property” in topology. However, if we were to 
pursue the analogy, we would restrict the < p’s to depend upon all 
the points inside small circles rather than upon fixed numbers of 
points. Accordingly, we will follow two parallel paths. One is 
based on restrictions on numbers of points and in this case we shall 
talk of predicates of limited order. The other is based on restric- 
tions of distances between the points, and here we shall talk of 
diameter-limited predicates. Despite the analogy with other im- 
portant situations, the concept of local based on diameter limita- 
tions seems to be less interesting in our theory — although one 
might have expected quite the opposite. 

More serious doubts arise from the narrowness of the “conjunc- 
tive” or “unanimity” requirement. As a next step toward ex- 
tending our concept of local , let us now try to separate essential 
from arbitrary features of the definition of conjunctive localness. 
The intention of the definition was to divide the computation of a 
predicate \p into two stages: 

Stage I: 

The computation of many properties or features p a which are each 
easy to compute, either because each depends only on a small part of 
the whole input space R, or because they are very simple in some 
other interesting way. 

Stage II: 

A decision algorithm Q that defines f by combining the results of the 
Stage I computations. For the division into two stages to be mean- 
ingful, this decision function must also be distinctively homogeneous, 
or easy to program, or easy to compute. 

The particular way this intention was realized in our example 
\ p convex was rather arbitrary. In Stage I we made sure that the 
s were easy to compute by requiring each to depend only upon 
a few points of R. In Stage II we used just about the simplest im- 



[10] 0.7 Introduction 


aginable decision rule; if the <p’s are unaminous we accept the 
figure; we reject it if even a single <p disagrees. 

We would prefer to be able to present a perfectly precise defini- 
tion of our intuitive local-vs.-global concept. One trouble is that 
phrases like “easy-to-compute” keep recurring in our attempt to 
formulate it. To make this precise would require some scheme for 
comparing the complexity of different computation procedures. 
Until we find an intuitively satisfactory scheme for this, and it 
doesn’t seem to be around the corner, the requirements of both 
Stage I and Stage II will retain the heuristic character that makes 
formal definition difficult. 

From this point on, we will concentrate our attention on a partic- 
ular scheme for Stage II — “weighted voting,” or “linear combina- 
tion” of the predicates of Stage I. This is the so-called perceptron 
scheme, and we proceed next to give our final definition. 

0.8 Perceptrons 

Let<f> = \<p i, <p 2 , . . . , <p n \ be a family of predicates. We will say that 
\p is linear with respect to 

if there exists a number 6 and a set of numbers \a <f>v a tf>r . . . , a^J 
such that \p(X) = 1 if and only if a <Pl <p\(X) + • • • + a tf>n (p n (X) > 0. 
The number 6 is called the threshold and the a’s are called the co- 
efficients or weights . (See Figure 0.2). We usually write more com- 
pactly 

yp{X) = 1 if and only if ^ a {p ^(X) > 6. 



Figure 0.2 



Introduction 0.8 [11] 


The intuitive idea is that each predicate of is supposed to pro- 
vide some evidence about whether \[/ is true for any figure X. If, on 
the whole, yp(X) is strongly correlated with <p(X) one expects a ^ 
to be positive, while if the correlation is negative so would be a 
The idea of correlation should not be taken literally here, but only 
as a suggestive analogy. 

Example: Any conjunctively local predicate can be expressed in 
this form by choosing 6 = - 1 and = - 1 for every ip. For then 

Or one. wvT+e (See 

£(-1M*)>-i £<p(x)--o , o, 

exactly when <p(X) = 0 for every <p in <i>. (The senses of true and 
false thus have to be reversed for the <£>’ s, but this isn’t im- 
portant.) 

Example: Consider the seesaw of Figure 0.3 and let X be an ar- 
rangement of pebbles placed at some of the equally spaced points 
\p u . . . , p ! }. Then R has seven points. Define <pi(X) = 1 if and 
only if X contains a pebble at the zth point. Then we can express 
the predicate 

“The seesaw will tip to the right” 
by the formula 

Z(» - 4 )**U0 > 0. 


where 6 = 0 and a, = (z — 4). 



Figure 0.3 


There are a number of problems concerning the possibility of infinite 
sums and such matters when we apply this concept to recognizing pat- 
terns in the Euclidean plane. These issues are discussed extensively 
in the text, and we want here only to reassure the mathematician that 
the problem will be faced. Except when there is a good technical reason 
to use infinite sums (and this is sometimes the case) we will make the 
problem finite by two general methods. One is to treat the retina R as 



[12] 0.8 Introduction 


made up of discrete little squares (instead of points) and treat as equiva- 
lent figures that intersect the same squares. The other is to consider 
only bounded X's and choose 4> so that for any bounded X only a finite 
number of <^’s will be nonzero. 

Definition: A perceptron is a device capable of computing all 
predicates which are linear in some given set <F of partial predi- 
cates. 

That is, we are given a set of <£>’s, but can select freely their 
“weights,” the aj s, and also the threshold 6. For reasons that 
will become clear as we proceed, there is little to say about all 
perceptrons in general. But, by imposing certain conditions and 
restrictions we will find much to say about certain particularly 
interesting families of perceptrons. Among these families are 

1. Diameter-limited perceptrons : for each cp in <T, the set of points 
upon which cp depends is restricted not to exceed a certain fixed 
diameter in the plane. 

2. Order-restricted perceptrons : we say that a perceptron has 
order < n if no member of <f> depends on more than n points. 

3. Gamba perceptrons : each member of <f> may depend on all the 
points but must be a “linear threshold function” (that is, each 
member of T> is itself computed by a perceptron of order 1, as 
defined in 2 above). 

4. Random perceptrons : These are the form most extensively 
studied by Rosenblatt’s group: the y s are random Boolean func- 
tions. That is to say, they are order-restricted and <f> is generated 
by a stochastic process according to an assigned distribution func- 
tion. 

5. Bounded perceptrons : $ contains an infinite number of y s, 
but all the a ^ lie in a finite set of numbers. 

fo give a preview of the kind of results we will obtain, we present 
here a simple example of a theorem about diameter-restricted per- 
ceptrons. 

Theorem 0.8: No diameter-limited perceptron can determine 
whether or not all the parts of any geometric figure are connected 
to one another! That is, no such perceptron computes i/' connected- 



Introduction 0.8 [13] 


The proof requires us to consider just four figures 


^00 Xoi X 10 X i) 

and a diameter-limited perceptron \p whose support sets have 
diameters like those indicated by the circles below: 



It is understood that the diameter in question is given at the start, 
and we then choose the XJs to be several diameters in length. 
Suppose that such a perceptron could distinguish disconnected 
figures (like Xqq and X n ) from connected figures (like X and 
X 0 \), according to whether or not 

^ a in ip > 6 


that is, according to whether or not 


Z 

group 1 


a v <P 


(X) + ^ a (p (p(X) + ^ a <p (p(X) 


group 2 


group 3 


> 0 


where we have grouped then’s according to whether their support 
sets lie near the left, right, or neither end of the figures. Then for 
Xqo the total sum must be negative. In changing to X l0 only 
2 g^up i is affected, and its value must increase enough to make the 






[14] 0.8 Introduction 


total sum become positive. If we were instead to change Xqo to 
Xo\ then 2 gr0U p2 would have to increase. But if we were to change 
Xoo to X u , both 2 groU p i and 2 groU p 2 will have to increase by these 
same amounts since (locally!) the same changes are seen by the 
group 1 and group 2 predicates, while 2 grou p 3 is unchanged in 
every case. Hence, net change in the A"oo * X n case must be even 
more positive, so that if the perceptron is to make the correct 
decision for X 00, X 0i , and X i0 , it is forced to accept Zn as con- 
nected, and this is an error! So no such perceptron can exist. 

Readers already familiar with perceptrons will note that this proof — 
which shows that diameter-limited perceptrons cannot recognize con- 
nectedness — is concerned neither with “learning” nor with probability 
theory (or even with the geometry of hyperplanes in ^-dimensional hyper- 
space). It is entirely a matter of relating the geometry of the patterns to 
the algebra of weighted predicates. Readers concerned with physiology 
will note that— insofar as the presently identified functions of receptor 
cells are all diameter-limited— this suggests that an animal will require 
more than neurosynaptic “summation” effects to make these cells com- 
pute connectedness. Indeed, only the most advanced animals can appre- 
hend this complicated visual concept. In Chapter 5 this theorem is 
shown to extend also to order-limited perceptrons. 

0.9 Seductive Aspects of Perceptrons 

The purest vision of the perceptron as a pattern-recognizing 
device is the following: 

The machine is built with a fixed set of computing elements for the partial 
functions tp, usually obtained by a random process. To make it recognize 
a particular pattern (set of input figures) one merely has to set the co- 
efficients to suitable values. Thus “programming” takes on a pleasingly 
homogeneous form. Moreover since “programs” are representable as 
points («i, • • • , «n) in an Az-dimensional space, they inherit a metric 

which makes it easy to imagine a kind of automatic programming which 
people have been tempted to call learning : by attaching feedback devices 
to the parameter controls they propose to “program” the machine by 
providing it with a sequence of input patterns and an “error signal” 
which will cause the coefficients to change in the right direction when 
the machine makes an inappropriate decision. The perceptron convergence 
theorems (see Chapter 1 1) define conditions under which this procedure 
is guaranteed to find, eventually, a correct set of values. 

0.9.1 Homogeneous Programming and Learning 

To separate reality from wishful thinking, we begin by making a 
number of observations. Let 4> be the set of partial predicates of 
a perceptron and L($) the set of all predicates linear in <£. Thus 



Introduction 0.9 [15] 


L(<£) is the repertoire of the perceptron — the set of predicates it 
can compute when its coefficients a ^ and threshold 6 range over 
all possible values. Of course L(<t>) could in principle be the set of 
all predicates but this is impossible in practice, since $ would 
have to be astronomically large. So any physically real perceptron 
has a limited repertoire. The ease and uniformity of programming 
have been bought at a cost! We contend that the traditional investi- 
gations of perceptrons did not realistically measure this cost. In 
particular they neglect the following crucial points: 

1. The idea of thinking of classes of geometrical objects (or pro- 
grams that define or recognize them) as classes of ^-dimensional 
vectors (a u ... , a n ) loses the geometric individuality of the 
patterns and leads only to a theory that can do little more than 
count the number of predicates in £($>)! This kind of imagery has 
become traditional among those who think about pattern recogni- 
tion along lines suggested by classical statistical theories. As a 
result not many people seem to have observed or suspected that 
there might be particular meaningful and intuitively simple predi- 
cates that belong to no practically realizable set £(<£). We will 
extend our analysis of ^connected t0 show how deep this problem 
can be. At the same time we will show that certain predicates 
which might intuitively seem to be difficult for these devices can , 
in fact, be recognized by low-order perceptrons: convex already 
illustrates this possibility. 

2. Little attention has been paid to the size, or more precisely, 
the information content, of the parameters a j, ..., a n . We will 
give examples (which we believe are typical rather than excep- 
tional) where the ratio of the largest to the smallest of the co- 
efficients is meaninglessly big. Under such conditions it is of no 
(practical) avail that a predicate be in L(<£). In some cases the 
information capacity needed to store a i, ... , a n is even greater 
than that needed to store the whole class of figures defined by the 
pattern! 

3. Closely related to the previous point is the problem of time of 
convergence in a “learning” process. Practical perceptrons are es- 
sentially finite-state devices (as shown in Chapter 11). It is there- 
fore vacuous to cite a “perceptron convergence theorem” as 
assurance that a learning process will eventually find a correct 



[16] 0.9 Introduction 


setting of its parameters (if one exists). For it could do so trivially 
by cycling through all its states, that is, by trying all coefficient 
assignments. The significant question is how fast the perceptron 
learns relative to the time taken by a completely random pro- 
cedure, or a completely exhaustive procedure. It will be seen that 
there are situations of some geometric interest for which the con- 
vergence time can be shown to increase even faster than ex- 
ponentially with the size of the set R. 

Perceptron theorists are not alone in neglecting these precautions. 
A perusal of any typical collection of papers on “self-organizing” 
systems will provide a generous sample of discussions of “learn- 
ing” or “adaptive” machines that lack even the degree of rigor 
and formal definition to be found in the literature on perceptrons. 
The proponents of these schemes seldom provide any analysis of 
the range of behavior which can be learned nor do they show 
much awareness of the price usually paid to make some kinds of 
learning easy: they unwittingly restrict the device’s total range of 
behavior with hidden assumptions about the environment in 
which it is to operate. 

These critical remarks must not be read as suggestions that we are 
opposed to making machines that can “learn.” Exactly the con- 
trary! But we do believe that significant learning at a significant 
rate presupposes some significant prior structure. Simple learning 
schemes based on adjusting coefficients can indeed be practical 
and valuable when the partial functions are reasonably matched 
to the task, as they are in Samuel’s checker player. A perceptron 
whose $ s are properly designed for a discrimination known to be 
of suitably low order will have a good chance to improve its 
performance adaptively. Our purpose is to explain why there is 
little chance of much good coming from giving a high-order prob- 
lem to a quasi-universal perceptron whose partial functions have 
not been chosen with any particular task in mind. 

It may be argued that people are universal learning machines and so a 
counterexample to this thesis. But our brains are sufficiently structured 
to be programmable in a much more general sense than the perceptron 
and our culture is sufficiently structured to provide, if not actual pro- 
gram, at least a rather complex set of interactions that govern the course 
of whatever the process of self-programming may be. Moreover, it takes 
time for us to become universal learners: the sequence of transitions 
from infancy to intellectual maturity seems rather a confirmation of the 



Introduction 0.9 [17] 


thesis that the rate of acquisition of new cognitive structure (that is, 
learning) is a sensitive function of the level of existing cognitive structure. 

0.9.2 Parallel Computation 

The perceptron was conceived as a parallel-operation device in 
the physical sense that the partial predicates are computed simul- 
taneously. (From a formal point of view the important aspect is 
that they are computed independently of one another.) The price 
paid for this is that all the <p, must be computed, although only 
a minute fraction of them may in fact be relevant to any partic- 
ular final decision. The total amount of computation may become 
vastly greater than that which would have to be carried out in a 
well planned sequential process (using the same (p's) whose 
decisions about what next to compute are conditional on the out- 
come of earlier computation. Thus the choice between parallel 
and serial methods in any particular situation must be based on 
balancing the increased value of reducing the (total elapsed) time 
against the cost of the additional computation involved. 

Even low-order predicates may require large amounts of wasteful com- 
putation of information which would be irrelevant to a serial process. 
This cost may sometimes remain within physically realizable bounds, 
especially if a large tolerance (or “blur”) is acceptable. High-order 
predicates usually create a completely different situation. An instructive 
example is provided by \^ CONNECTED . As shown in Chapter 5, any per- 
ceptron for this predicate on a 100 x 100 toroidal retina needs partial 
functions that each look at many hundreds of points! In this case the 
concept of “local” function is almost irrelevant: the partial functions are 
themselves global. Moreover, the fantastic number of possible partial 
functions with such large supports sheds gloom on any hope that a 
modestly sized, randomly generated set of them would be sufficiently 
dense to span the appropriate space of functions. To make this point 
sharper we shall show that for certain predicates and classes of partial 
functions, the number of partial functions that have to be used (to say 
nothing of the size of their coefficients) would exceed physically realiz- 
able limits. 

The conclusion to be drawn is that the appraisal of any particular 
scheme of parallel computation cannot be undertaken rationally 
without tools to determine the extent to which the problems to be 
solved can be analyzed into local and global components. The 
lack of a general theory of what is global and what is local is no 
excuse for avoiding the problem in particular cases. This study 
will show that it is not impossibly difficult to develop such a 
theory for a limited but important class of problems. 



[18] 0.9 Introduction 


0.9.3 The Use of Simple Analogue Devices 

Part of the attraction of the perceptron lies in the possibility of 
using very simple physical devices — “analogue computers” — to 
evaluate the linear threshold functions. It is perhaps generally 
appreciated that the utility of this scheme is limited by the sparse- 
ness of linear threshold functions in the set of all logical functions. 
However, almost no attention has been paid to the possibility that 
the set of linear functions which are practically realizable may- 
be rarer still. To illustrate this problem we shall compute (in 
Chapter 10) the range and sizes of the coefficients in the linear 
representations of certain predicates. It will be seen that certain 
ratios can increase faster than exponentially with the number of 
distinguishable points in R. It follows that for “big” input sets— 
say, R's with more than 20 points— no simple analogue storage 
device can be made with enough information capacity to store the 
whole range of coefficients! 

To avoid misunderstanding perhaps we should repeat the quali- 
fications we made in connection with our critique of the percep- 
tron as a model for “learning devices.” We have no doubt that 
analogue devices of this sort have a role to play in pattern 
recognition. But we do not see that any good can come of experi- 
ments which pay no attention to limiting factors that will assert 
themselves as soon as the small model is scaled up to a usable size. 

0.9.4 Models for Brain Function and Gestalt Psychology 

The popularity of the perceptron as a model for an intelligent, 
general-purpose learning machine has roots, we think, in an 
image of the brain itself as a rather loosely organized, randomly 
interconnected network of relatively simple devices. This impres- 
sion in turn derives in part from our first impressions of the be- 
wildering structures seen in the microscopic anatomy of the brain 
(and probably also derives from our still-chaotic ideas about 
psychological mechanisms). 

In any case the image is that of a network of relatively simple 
elements, randomly connected to one another, with provision for 
making adjustments of the ease with which signals can go across 
the connections. When the machine does something bad, we will 
“teach” it not to do it again by weakening the connections that 
were involved; perhaps we will do the opposite to reward it when 
it does something we like. 



Introduction 0.9 [19] 


The “perceptron” type of machine is one particularly simple 
version of this broader concept; several others have also been 
studied in experiments. 

The mystique surrounding such machines is based in part on the 
idea that when such a machine learns the information stored is 
not localized in any particular spot but is, instead, “distributed 
throughout” the structure of the machine’s network. It was a great 
disappointment, in the first half of the twentieth century, that 
experiments did not support nineteenth century concepts of the 
localization of memories (or most other “faculties”) in highly 
local brain areas. Whatever the precise interpretation of those not 
particularly conclusive experiments should be, there is no ques- 
tion but that they did lead to a search for nonlocal machine- 
function concepts. This search was not notably successful. Several 
schemes were proposed, based upon large-scale fields, or upon 
“interference patterns” in global oscillatory waves, but these 
never led to plausible theories. (Toward the end of that era a more 
intricate and substantially less global concept of “cell-assembly” 
— proposed by D. O. Hebb [1949] — lent itself to more productive 
theorizing; though it has not yet led to any conclusive model, its 
popularity is today very much on the increase.) However, it is not 
our goal here to evaluate these theories, but only to sketch a 
picture of the intellectual stage that was set for the perceptron 
concept. In this setting, Rosenblatt’s [1958] schemes quickly took 
root, and soon there were perhaps as many as a hundred groups, 
large and small, experimenting with the model either as a “learn- 
ing machine” or in the guise of “adaptive” or “self-organizing” 
networks or “automatic control” systems. 

The results of these hundreds of projects and experiments were 
generally disappointing, and the explanations inconclusive. The 
machines usually work quite well on very simple problems but 
deteriorate very rapidly as the tasks assigned to them get harder. 
The situation isn’t usually improved much by increasing the size 
and running time of the system. It was our suspicion that even in 
those instances where some success was apparent, it was usually 
due more to some relatively small part of the network, and not 
really to a global, distributed activity. Both of the present authors 
(first independently and later together) became involved with a 
somewhat therapeutic compulsion: to dispel what we feared to be 



[20] 0.9 Introduction 


the first shadows of a “holistic” or “Gestalt” misconception that 
would threaten to haunt the fields of engineering and artificial 
intelligence as it had earlier haunted biology and psychology. 
For this, and for a variety of more practical and theoretical goals, 
we set out to find something about the range and limitations of 
perceptrons. 

It was only later, as the theory developed, that we realized that 
understanding this kind of machine was important whether or not 
the system has practical applications in particular situations! For 
the same kinds of problems were becoming serious obstacles to 
the progress of computer science itself. As we have already re- 
marked, we do not know enough about what makes some algo- 
rithmic procedures “essentially” serial, and to what extent — or 
rather, at what cost — can computations be speeded up by using 
multiple, overlapping computations on larger more active 
memories. 

0.10 General Plan of the Book 

The theory divides naturally into three parts. In Part I we explore 
some very general properties of linear predicate families. The 
theorems in Part I apply usually to all perceptrons, independently 
of the kinds of patterns considered; therefore the theory has the 
quality of algebra rather than geometry. In Part II we look more 
narrowly at interesting geometric patterns, and get sharper but, of 
course, less general, theorems about the geometric abilities of our 
machines. In Part III we examine a variety of questions centered 
around the potentialities of perceptrons as practical devices for 
pattern recognition and learning. The final chapter traces some of 
the history of these ideas and proposes some plausible directions 
for further exploration. 


\o reaxl +^5 loooW cl oes voo'f WavJe a (,[ 

l/>ia4[iewGtf(£5 •'* , 4'- 

,, kairdetr , ‘ e Waitcct [ sec^toiAs a ire " vn >^( 1/ — ■ 

cq ia >oe i-t/t'f’bovf’ lost in cj fine 

l The locvs~f etna, p 'fries 4c? sfv'p 

^ , $5 > § 7 1 <W $ 10 . 



ALGEBRAIC THEORY OF LINEAR 
PARALLEL PREDICATES 


I 



[22] Algebraic Theory of Linear Parallel Predicates 


Introduction to Part I 

Part I (Chapters 1-4) contains a series of purely algebraic defini- 
tions and general theorems used later in Part II. It will be easier to 
read through this material if one has already a preliminary picture 
of the roles these mathematical devices are destined to play. We 
can give such a picture by outlining how we will prove the follow- 
ing theorem: We c\o not expect the rectJev vecfl^ absorb 

con 6?v\se A S^nofsis. 4 

Theorem 3.1 (Chapter 3) Informal Version: Suppose the retina 
R has a finite number of points. Then there is no perceptron 
Za (f (p(X) > 6 that can decide whether or not the “number of 
points in X is odd” unless at least one of the <^'s depends on all 
the points of R . 

Thus no bound can be placed on the orders of perceptrons that 
compute this predicate for arbitrarily large retinas. To realize it a 
perceptron has to have at the start at least one <p that looks at 
the whole picture! The proof uses several steps: 

Step 1 : In §1.1 -§1.4, we define “perceptron,” “order,” etc., more 
precisely, and show that certain details of the definitions can be 
changed without serious effects. 

Step 2: In §1.3 we define the particularly simple functions called 
“masks.” For each subset A of the retina define the mask <p A {X) 
to have value 1 if the figure X contains or “covers” all of A , value 
0 otherwise. Then we prove the simple but important theorem 
(§1.5) that if a predicate has order < k (see §1.3) in any set of 
ip functions, then there is an equivalent perceptron that uses only 
masks of size < k t 

Step 3: To get at the parity — the “odd-even” property — we ask: 
What rearrangements of the input space R leaves it unaffected ? 
That is, we ask about the group of transformations of the figure 
that have no effect on the property. This might seem to be an 
exotic way to approach the problem, but since it seems necessary 
for the more difficult problems we attack later, it is good first to 
get used to it in this simple situation. In this case, the group is 
the whole permutation group on R — the set of all rearrangements 
of its points. 

Step 4\ In Chapter 2 we show how to use this group to reduce the 
perceptron to a simple form. The group-invariance theorem 



Introduction to Part I [23] 


(proved in §2.2) is used to show that, for the parity perceptron, all 
masks with the same support size — that is, all those that look at 
the same number of points — can be given identical coefficients. 
Let f3j be the weight assigned to all masks that have support 
size = j. 



Group-invariant coefficients for \R \ = 3 parity predicate. 


Step 5: It is then shown (in §3.1) that the parity perceptron can be 

written in the form , , , v'J-l 

I*. vatTj one can vuse (2>^ - (-3.J 

■A (\X\\ Wo 5*1 culler yu/vnbe /5 do. 

o ) > °’ See ^10.0.. 

where | A'l is the number of points in X, k is the largest support 
size, and ( f 1 ) is the number of subsets of X that have j elements. 

Step 6: Because 


J j • ( n - j)'- j'- 


■ (n + 1 - 1) • (n + 1 - 2) ■ 


■(«+ 1 - j) 






[24] Algebraic Theory of Linear Parallel Predicates 


is a product of j linear terms, it is a polynomial of degree j in n. 
Therefore we can write our predicate in the form 

P k ( \X\) > 0, 

where P k is a polynomial in \X\ of algebraic degree < k. Now if 
\X\ is an odd number, P^(|A"|) > 0 while if \X\ is even, 
P k ( | X | ) < 0. Therefore, in the range 0 < \X\ < j/? |, Pk must 
change its direction \R\ - 1 times. But a polynomial must have 
degree > \R\ to do that, so we conclude that k > |/?|. This 
completes the proof exactly as it will be done in §3.1. 



This shows how the algebra works into our theory. For some of 
the more difficult connectedness theorems of Chapter 5, we need 
somewhat more algebra and group theory. In Chapter 4 we push 
the ideas about the geometry of algebraic degrees a little further 
to show that some surprisingly simple predicates require un- 
bounded-order perceptrons. But the results of Chapter 4 are not 
really used later, and the chapter could be skipped on first read- 
ing. 

To see some simpler, but still characteristic results, the reader 
might turn directly to Chapter 8, which is almost self-contained 
because it does not need the algebraic theory. 



Theory of Linear Boolean Inequalities 

l 


1.0 

In this chapter we introduce the theory of the linear representa- 
tion of predicates. We will talk about properties of functions 
defined on an abstract set of points, without any additional 
mathematical or geometrical structure. Thus this chapter can be 
regarded as an extension of ordinary Boolean algebra. Later, the 
theorems proved here will be applied to sets with particular geo- 
metrical or topological structures, such as the ordinary Euclidean 
plane. So, we begin by talking about sets in general; only later 
will we deal with properties of familiar things like “triangles.” 

We shall begin with predicates defined for a fixed base space R. In 
§1 . 1— §1 .5, whenever we speak of a predicate we assume an R is already 
chosen. Later on, we will be interested in “predicates” defined more 
broadly, either entirely independent of any base space, or on any one of 
some large family of spaces. For example, the predicate 

The set X is nonempty 

can be applied to any space R. The predicate 
The set X is connected 

is meaningful when applied to many different spaces in which there is a 
concept of points being near one another. In §1.6 we will introduce the 
term “predicate scheme” for this more general sense of “predicate.” Our 
main goal is to define the general notion of order of a predicate (§1.5) and 
the notion of finite order of a predicate-scheme (§1.6). In later chapters 
we will use the term predicate loosely to refer also to predicate-schemes, 
and in §1.7 there are some remarks on making these definitions more 
precise and formal. But we do not recommend readers to worry about 
this until after the main results are understood intuitively. 


1.1 Definitions 

The letter R will denote an arbitrary set of points. We will usually 
use the lower case letters a, b, c, . . . , x, y, z for individual points 
of R and the upper case A, B, C, . . . , X, Y, Z for subsets of R. 
Usually “x” and “A”’ will be used for “variable” points and sub- 
sets. 

We will often be interested in particular “families” of subsets, and 
will use small upper-case words for them. Thus circle is the set of 
subsets of R that form complete circles (as in §0.5). For an 
abstract family of subsets we will use the bold-face F. 



[26] 1.1 Algebraic Theory of Linear Parallel Predicates 


It is natural to associate with any family F of sets a predicate 
\p F (X) which is true if and only if X is in the family F. For ex- 
ample \p convex ('^0 is true or false according to whether X is or is 
not a convex set. Of course, ^circle an d ’/'convex are meaningless 
except on nonabstract R's to which these geometric ideas can be 
applied. The Greek letters tp and \f/ will always represent predi- 
cates. \p will usually denote the predicate of main interest, while 
(p predicates are usually in a large family of easily computed func- 
tions; the symbol will refer to that family. 

A predicate is a function (of subsets of R) that has two possible 
values. Sometimes we think of these two values as “true” and 
“false”; other times it is very useful to think of them as “1” and 
“0.” Because there is occasionally some danger of confusing these 
two kinds of predicate values, we have introduced the notation 
\\p(X)] to avoid ambiguity. The corners always mean that the “1” 
and “0” values are to be used. This makes it possible to use the 
values of predicates as ordinary numbers, and this is important 
in our theory since, as discussed in Chapter 0, we have to combine 
evidence obtained from predicates. Any mathematical statement 
can be used inside the corners: for example, since 3 is less than 5, 
and 1 is less that 2, we can write 

[3 < 51 = 1, 

[3 < 51 + [1 < 21 = 2, 

[3 < 51 + [5 < 31 = 1, 

or even 


[3 < [5 = 111 = 0, 

4T3 < 51 + 2- [6 < 21 = 4. 

It will sometimes be convenient to think of the points of R as 

enumerated in a sequence x u x 2 , x 3l . . . , x h Then many 

predicates can be expressed in terms of the traditional representa- 
tions of Boolean algebra. For example the two expressions 

Xi V Xj \XitX OR XjeX] 

have the same meaning, namely, they have value 1 if either or 
both of Xi and Xj are in X, value 0 if neither is in X. Technically 



Theory of Linear Boolean Inequalities 1.2 [27] 


this means that one thinks of a subset X of R as an assignment 
of values 1 or 0 to the x/s according to whether the / th point is in 
X, so “x ” is used ambiguously both to denote the /th point and 
to denote the set-function [x/c X\. We can exploit this by writing 
predicates in arithmetic instead of logical forms, that is, 

fxi + x 2 + x 3 > 0] instead of X\ V x 2 V x? 
or even 


[ 2 x ^2 - X\ - x 2 > -11 instead ofxi = x 2 , 

where X! = x 2 is the predicate that is true if both, or neither, but 
not just one of x\ and x 2 , are in X. 

We will need to be able to express the idea that a function may 
“depend” only on a certain subset of R. We denote by S(<p) the 
subset of R upon which v “really depends”: technically, S(cp) is 
the smallest subset S of R such that, for every subset X of R , 

<p(x) = <p(x n s), 

where “ X H S” is the intersection, that is, the set of points that 
are in both, of X and S. We call S(<p) the support of <p. 

For an infinite space /?, some interesting predicates will have S((p) unde- 
fined. Consider for example 

if{X) = [^contains an infinite set of pointsl. 

One could determine whether <p(X) is true by examining the points of X 
that lie in any set S that contains all but a finite number of points of R. 
And there is no “smallest” such set! 

1.2 Functions Linear with respect to a Class of Predicates 

Let $ be a family of predicates defined on R. We say that 

yp is a linear threshold function with respect to <L, 


if there exists a number 6 and a set of numbers a(^), one for each 
<p in <F, such that 

T a(<f) ■ <p(X) > 9 


WX) = 



[28] 1.2 Algebraic Theory of Linear Parallel Predicates 


That is, \p(X) is true exactly when the inequality within the f l’s is 
true. We often write this less formally as 

\p = \2a((p)<p > 6], or even as \p = > 01. 

For symmetry, we want to include with its negation 

J(X) = [2 a(<p)<p < 6\ 

in the class of linear threshold functions. For a given #, we use 
L(<L) to denote the set of all predicates that can be defined in this 
way — that is, by choosing different numbers for 6 and for the as. 

V* 



For a two-point space R = {x,y), the class L({jc, >>]) of functions linear 
in the two one-point predicates includes 14 of the 16 = 2 2 * possible 
Boolean functions. For larger numbers of points, the fraction of func- 
tions linear in the one-point predicates decreases very rapidly toward 


zero. 


Theory of Linear Boolean Inequalities 1.2 [29] 


1.2.1 Other possible definitions of L(<f>) 

Because the definition of L(<£) is so central to what follows, it is 
worth examining it to see which features are essential and which 
are arbitrary. The following proposition will mention a number of 
ways the definition could be changed without significantly altering 
its character. In fact, for finite R , the most important case, all the 
proposed alternatives lead to strictly equivalent definitions. In the 
case of infinite R-spaces, some of them lead to different meanings 
for L(<J>) but never in a way that will affect any of our subsequent 
discussions. 

Proposition: The following modifications in the formal definition 
of L(<f>) result in defining the same classes of predicates. 

(1) If $ is assumed to contain the constant function, I(X) = 1, then 
0 can be taken to be zero. 

(2) The inequality sign “>” can be replaced by “<,” “>,” or “<.” 

(3) If R is finite then all the a(p)’s, and 0, can be confined to integer 
values. 

(4) All the alternatives in 1-3 can be chosen independently. 

These assertions are all obviously true; the following proofs are 
intended mainly to help readers who would like some practice in 
using our notations. 

proof of (1): Define a '( I ) = a ( I ) — 0 and otherwise a '(( p ) = 
a((p). Then 

\2a(<p)<p(X) > 0} = f2a'(< p)<p(X) > 01. 

PROOF of (2): Let a '(< p ) = - a(»and0' = - 0 . Then 
\2a(<p)<p < 0} = 12 «'(*>)* > 0'1. 

The other replacements follow by exchanging all predicates and 
their negations. 

proof of (3): If R is finite then 4> is finite and we can assume that 
there is no X for which 

2 a(<p)<p(X) = 6. 

For, if there is such an X we can remedy this by changing 6 to 
0 yd, where S is less than the smallest nonzero value of 
1 2a(<p)(p(X) -6 |. Suppose first that all the a(p)'s are 



[30] 1.2 Algebraic Theory of Linear Parallel Predicates 


rational. Let D be the product of all their denominators and 
define 

a'((p) = Da((p) and 6 f = DO. 

Then the a'(<pY s are all integers and clearly 
\2a(<p)<p(X) > 6} = \-Za\ip)ip{X) > O'] 

for all X. Now suppose that some members of { a(ip)} are irra- 
tional. Then replace each a(<p) by some rational number in 
the interval 

<5 

a(<p) < a’(<p) < a((f) + 

where 5 is as defined above. This replacement cannot change the 
sum 2a(<^) <p(X) by as much as <5, so it can never affect the value 
off 2a(<p) <p(X) > 0]. For there are at most 2 2 different (p's. 

1.3 The Concept of Order 

Predicates whose supports are very small are too local to be 
interesting in themselves. Our main interest is in predicates whose 
support is the whole of R, but which can be represented as linear 
threshold combinations of predicates with small support. A simple 
example is 

\p(X) = IX is not emptyl. 

Clearly S(\p) = R . On the other hand if we let <L be the set of 
predicates of the form (p p (X ) = Ip e X] we have 

J |S(<^)| = 1 for all (f in <£, 

\ WX) = \2<P P {X) > 01 . 

These two statements allow us to say that the order of \f/ is 1. In 
general, the order of \[/ is the smallest number k for which we can 
find a set of predicates satisfying 


|S(<p)| < k for all (p in 



Theory of Linear Boolean Inequalities 1.4 [31] 


It should be carefully observed that the order of ^ is a property of 
\[/ alone, and not relative to any particular <£. This is what 
makes it an important “absolute” concept. Those who know the 
appropriate literature will recognize “order 1” as what are usually 
called “ linear threshold functions .” 

1.4 Masks and other Examples of Linear Representation 

A very special role will be played by predicates of the form 

< p A (X ) = [all members of A are in X] 

= \A cn 

In the conventional Boolean point-function notations, these 
predicates appear as conjunctions: if A = \y { , ... , y n } then 
Pa(X) = y\ A y 2 A . . . A y n or, as it is usually written, <p A (X) = 
T1T2...JV 

We shall call cp A the mask of the set A. In particular the constant 
predicate I{X) is the mask of the empty set; and the predicates 
(p p in the previous paragraph are the masks of the one-point sets. 






%imk = 1 


Proposition: All masks are of order 1. 

proof: Let A be any finite set. It contains | A | points. For each 
point x e A define <p x (X) to be \x e X}. Then 


<Pa(X) = 


X vx(X) > Ml 


Example 1: Of the 16 Boolean functions of two variables, all have 
order 1 except for exclusive-or, x 0 y, and its complement ident- 
ity , x = y, which are of order 2: 


x ® y = \xy + xy > 01, 
x = y = [xy + xy > 01, 




[32] 1.4 Algebraic Theory of Linear Parallel Predicates 


where, for example, “xy” is the predicate of support = 2 which is 
true only when x is in X and y is not in X. ( Problem : prove that 
x ® y is not order 1 !) Other examples from Boolean algebra are 

x D y = \x V y] - \y - x > -11 
~ x = \ — x > - 11. 



( stereoscopic) 

One can think of a linear inequality as defining a surface that separates 
points into two classes, thus defining a predicate. We do not recommend 
this kind of imagery until Part III. 


Any mask has order 1: 
x A y A z = \x + y + z > 2] 
as does any disjunction 
xVyVz = \x + y + z>0]. 

Example 2: X\ = x 2 can be represented as a linear combination of 
masks by 

V x\x 2 = \x { x 2 + (1 - x,)(l - x 2 ) > 01 
= \lx x x 2 - X\ - x 2 > - 11. 

A proof that “exclusive-or” and equivalence are not of order 1 
will be found in §2. 1 . 

Example 3: Let M be an integer 0 < M < \R | . Then the “count- 
ing predicate” \p M , or \\X\ = M ], which recognizes that X con- 
tains exactly M points , is of order 2. 



Theory of Linear Boolean Inequalities 1.5 [33] 


proof: Consider the representation 
(2 M -1)2 x, + (-2) 2 x,xj > M 1 

all/ i<j 


For any figure X there will be \X\ terms x ( with value 1, and 
\ \X\ • (\X\ - 1) terms x t Xj with value 1. Then the predicate is 

[(2 M - 1) • \X\ - \X\ - ( \X\ -1) - M 2 > 0] = \(\X\ - M) 2 < 01 

and the only value of \X\ for which this is true is \X\ = M. Ob- 
serve that, by increasing the threshold we can modify the predi- 
cate to accept counts within an arbitrary interval instead of a 
single value. 



We have shown that the order is not greater than 2; Theorem 2.4 
will confirm that it is not order 1. Note that the linear form for 
the counting predicate does not contain \R \ explicitly. Hence it 
works as well as for an infinite space R. 

Example 4: The predicates [ \X\ > M] and f \X\ < M] are of 
order 1 because they are represented by [2x, > Ml and 
ISx/ < Ml. 

1.5 The “Positive Normal Form Theorem” 

The order of a function can be determined by examining its repre- 
sentation as a linear threshold function with respect to the set of 



[34] 1.5 Algebraic Theory of Linear Parallel Predicates 


masks (Theorem 1 .5.3). To prove this we first show 

Theorem 1.5.1: Every \f/ is a linear threshold function with respect 
to the set of all masks, that is, \[/ e L(all masks) . 

proof: Any Boolean function . . . , x n ) can be written in the 
disjunctive normal form 

C x (X) V C 2 (X) v ... V C P (X), 

where each Cj(X ) has the form of a product (that is, a conjunc- 
tion) 

Z!Z 2 ...Z n 

in which each z is either an Xj or an x,-. Since at most one of the 
Ci(X) can be true for any X , we can rewrite \p, using the arith- 
metic sum 

C x (X) + C 2 (X) + ... + C P (X). 

Next, the following formula can be applied to any product con- 
taining a barred letter: let $ and £ be any strings of letters. 

$ Xj £ = $(1 - Xj)£ = $£ - $Xj£ 

If we continue to apply this, all the bars can be removed, without 
ever increasing the length of a product. 

When all the bars have been eliminated and like terms have been 
collected together we have 

\f/(X) = IZoLiip i(X), POSITIVE NORMAL FORM 

where each <£>, is a mask, and each a, is an integer. Since 'LoL i <p i (X) 
is 0 or 1, this is the same as 

WX) = f > 01. 

Example: \x x + x 2 + x 3 is oddl = x x -h x 2 + x 3 - 2x\x 2 - 
2x 2 x 3 - 2x 3 x\ 4 - 4xix 2 x 3 . 



Theory of Linear Boolean Inequalities 1.5 [35] 


Theorem 1.5.2: The positive normal form is unique (Optional) 

proof: To see this let {<?,} be a set of masks and {7,} a set of numbers, 
none equal to zero. Choose a k for which S((p k ) is minimal, that is, there 
is noy 5 * k such that S((pj) C S(<Pk)- Then 

<Pk(S(<Pk)) = 1 , 

<Pj(S(<p k )) = 0 j ^ k. 

It follows that 'Lynpj(X) is not identically zero since it has the value 
yk for X = S((p k ). 

Now if 2cti<Pi(X) = 20, tpi(X) for all X y then 2 (a, - 0,) *>,(*) = 0 
for all X. But 

22 («/ - PilvAX) = 22 (a,- - ddViiX) 

all; 

It follows that all a, = 0 f . This proves the uniqueness of the coefficients 
of the positive normal form of \p. Note that the positive normal form 
always has the values 0 and 1 as ordinary arithmetic sums; i.e., without 
requiring the [ 1 device of interpreting the validity of an inequality as a 
predicate. 

Theorem 1.5.3: \f/ is of o rder k if and only if k is the smallest num- 
ber for which there exists a set <l> of masks satisfying 


f \S(<p)\ < k for all p in $ 
t * € L(*). 


proof: In \p = [2a,f ( > 01, each <^,can be replaced by its positive 
normal form. If |S(<p/) | < k, this will be true also of all the 
masks that appear in the positive normal form. 

Example: A “Boolean form” has order no higher than the degree 
in its disjunctive normal form. Thus 

^ o! ijk x t XjX k — 2 cx ij k% ; -X j 2 a ijk XiXjX k , 

illustrating how the negations are removed without raising the 
order. This particular order-3 form appears later (§6.3) in a per- 
ceptron that recognizes convex figures. 

It is natural to wonder about the orders of predicates that are 
Boolean functions of other predicates. Theorem §1.5.4 gives an 



[36] 1.5 Algebraic Theory of Linear Parallel Predicates 


encouraging result: 

Theorem 1.5.4: If has order O x and \p 2 has order 0 2 , then 
0^2 and \p\ = \p 2 have order < O x + 0 2 ■ 

proof: Let i/q = [ 2a,<p/ > 01 and > 01 and as- 

sume that the coefficients are chosen so that the inner sums never 
exactly equal zero. 

^\ = ^2 = > 01 = fS lV (a,aj ) (fufj > 01 

But 


|S(*/<e,)l < |S(*/)I + |S(^)|. 

The other conclusion follows from \yf/ x 0 ^ 2 1=1- [i/q = ^21- 
Example: Since 

^ M (Z) = ffM > \X\ 1 = f |JT| > Mil 


we conclude that \p M has order < 2, which is another way to ob- 
tain the result of §1.4, Example 3. 

Question: What can be said about the orders of [i/q A ^ 2 1 and 
\x[/\ V \p 2 1? The answer to this question may be surprising, in view 
of the simple result of Theorem 1.5.4: It is shown in Chap- 
ter 4 that for any order n , there exists a pair of predicates \f/\ and 
\p 2 , both of order /, for which (i/q A ^ 2 ) and OAi V \p 2 ) have order 
> n. In fact, suppose that R = A U B U C where ^4, 5, and C 
are large disjoint subsets of R. Then = [ \X H M > 
\X n C| 1 and \[/ 2 = I \X Pi B\ > \X H C| 1 each have order 1 
because they are represented by 


2 *■• > 2 X < 
XjeA XjeC 


and 


E *' > X i 

x^B XjtC 


but we shall see in Chapter 4 that (\[/\ A fi) and (\p\ V \[/ 2 ) are not 
even of finite order in the sense about to be described in §1.6. 


1.6 Predicates of Finite Order 

Strictly, a predicate is defined for a particular set R and it makes 
no formal sense to speak of the same predicate for different R's. 



Theory of Linear Boolean Inequalities 1.6 [37] 


But, as noted in §1.0, our real motivation is to learn more about 
“predicates” defined independently of R — for example, concern- 
ing the number of elements in the set X , or other geometric 
properties of figures in a real Euclidean plane to which X and R 
provide mere approximations. To be more precise we could use a 
phrase such as predicate scheme to refer to a general construction 
which defines a predicate for each of a large class of sets R. This 
would be too pedantic so (except in this section) we shall use 
“predicate” in this wider sense. 

Suppose we are given a predicate scheme \p which defines a predi- 
cate \p R for each of a family {/?} of retinas. We shall say that \[/ is 
of finite order , in fact of order < k, if the orders of the \f/ R are 
uniformly bounded by k for all R’s in the family. Two examples 
will make this clearer: 

1. Let {jR,} be a sequence of sets with | R, | = i. For each R ( there 
is a predicate defined by the predicate scheme ^ PARITY (X) which 
asserts, for X C Ri, that “ \X\ is an odd number .” As we will 
see in §3.1, the order of any such \p t must be i. Thus \[/p XKlTY is not 
of finite order. 

2. Now let be the predicate defined over /?, by the predicate 
scheme f ten* 


tfx) = r \x\ = loi. 

We have shown in §1.4, that \pj is of order 2 for all R t with / > 10, 
and it is (trivially) of order zero for R u . . . , R 9 . Thus the predi- 
cate scheme t^ ten is of finite order; in fact, it has order 2. 

In these cases one could obtain the same dichotomy by considering in- 
finite sets R. On an infinite retina the predicate 

i£ TEN (;n = r 1*1 = ioi 

is of finite order, in fact of order = 2, while 
’/''parity {X) = [ \X\ is oddl 

has no order. We shall often look at problems in this way, for it is often 
easier to think about one machine, even of infinite size, than about an 
infinite family of finite machines. In Chapter 7 v/e will discuss formaliza- 
tion of the concept of an infinite perceptron. It should be noted, however, 
that the use of infinite perceptrons does not cover all cases. For example, 



[38] 1.6 Algebraic Theory of Linear Parallel Predicates 


the predicate 

t(X) = [ \x\ > i |/t|i 

is well-defined and of order 1 for any finite R. It is meaningless for in- 
finite /?, yet we might like to consider the corresponding predicate scheme 
to have finite order. 



Group Invariance of Boolean Inequalities 

2 


2.0 

In this chapter we consider linear threshold inequalities that are 
invariant under groups of transformations of the points of the 
base-space R. The purpose of this, finally achieved in Part II, is to 
establish a connection between the geometry of R and the realiz- 
ability of geometric predicates by perceptrons of finite order. 


2.1 Example: Coefficients Averaged Over a Symmetry 
As an introduction to the methods introduced in this section we 
first consider a simple, almost trivial, example. Our space R has 
just two points, x and y. We will prove that the predicate - ^=- 
-{ xy - Vxy j- is not of order 1. - (Th rees- the pr^dicat e- 4hat - ass e r te-tfert 
i at. ) One way to prove this is to deduce a con- 
tradiction from the hypothesis that numbers a, /?, and 0 can be 
found for which 

r* _ 

\p^(x,y) = xy-*h-xy = f ax + f3y > 01. 


[Tb * 5 

assert^ Hi 
X 'S a.11 bloc 
or all white 


We can proceed directly by writing down the conditions on a 
and /?: 


^ = (1,0) = 0 => a <6 (b oil + ^0 ) 

tMO, 1) = 0 =► /3 < 0 
^.(1, l) = l =* a + (3 > 6 
^(0,0) = 1 =* 0 > 0 


In this simple case it is easy enough to deduce the contradiction, 
for adding the first two conditions gives us 

a + /? < 20 , 

and this, with the third implies that 

0 < 20 , 

and this would make 0 positive, contradicting the fourth con- 
dition. 

But arguments of this sort are hard to generalize to more com- 
plex situations involving many variables. On the other hand the 
following argument, though it may be considered a little more 



[40] 2.1 Algebraic Theory of Linear Parallel Predicates 


complicated in itself, leads directly to much deeper insights. First 
we observe that the value of f is unchanged when we permute, 
that is, interchange, x and y. That is, 

Thus if one of the following holds, so must the other: 

ax 4- f3y > 6 
ay + 0x > 6\ 

hence 

i (a + 0) x + \ (a + /?) y > 6 

by adding the inequalities. Similarly, either of 

ax 4- (3y < 6 
ay + /3x < 6 

yields 

l ( a + P) x + 1 {ol + /?) y < 6. 

It follows that if we write y for \ (a + 0), then 

\p = (x,y) = \yx + yy > 6] = \y(x + y) > 0]. 

Thus we can construct a new linear representation for \p in which 
the coefficients of x and y are equal. 

It follows that 


iM*) = \y\x\ > 9], 


where \X\ is, as usual, the number of points in X. 

Now consider the three figures X 0 = {}, X { = {x}, X 2 = 

l^o | = 0 and y • 0 > 6 

\Xi \ = 1 and 7*1 < 0, 

\X 2 \ = 2 and y • 2 > 6. 



Group Invariance of Boolean Inequalities 2.1 [41] 


This is clearly impossible. Thus we learn something about if/ by 
“averaging" its coefficients after making a permutation that leaves 
the predicate unchanged . (In the example 7 is the average of a and 
f3.) In §2.3 we shall define precisely the notion of “average” that 
is involved here. 



the shaded regions, but this requires a polynomial of second 
or higher degree. 

2.1.1* Groups and Equivalence-classes of Predicates 

The generalization of the procedure of §2.1 involves introducing 
an arbitrary group of transformations on the base space R, and 
then asking what it means for a predicate \p to be unaffected by 
any of these transformations (just as the predicate of §2.1 was un- 
affected by transposing two points). It is through this idea of 
“invariance under a group of transformations” that we will be 
able to attack geometrical problems; in so doing we are adopting 
the mathematical viewpoint of Felix Klein: every interesting geo- 
metrical property is an invariant of some transformation group. 

A good example of a group of transformations is the set of all 
translations of the plane : a translation is a transformation that 
moves every point of the plane into another point, with every 
point moved the same amount in the same direction; that is, a 
rigid parallel shift. Figure 2.2 illustrates the effect of two transla- 


*This section can be skipped by readers who know the basic definitions of the 
theory of groups. 



[42] 2.1 Algebraic Theory of Linear Parallel Predicates 



tions, gi and g 2 , on a figure X. The picture illustrates a number of 
definitions and observations we want to make. 

1. We define a translation to operate upon individual points, so 
that g i operating on the point x yields another point g x x. This 
“induces” a natural sense in which the g’s operate on whole 
figures; let us define it. If g is one of a group G of transforma- 
tions— abbreviated “g e G”— and if A" is a figure, that is, a subset 
of R , we define 

gX = \gx\xeX] 

which is read: gX is (defined to be) the set of all points gx ob- 
tained by applying g to points x of X. 

2. If we apply to X first one transformation g x and then another 
transformation g 2 we obtain a new figure that could be denoted 
by “g 2 (g\X)” But that same figure could be regarded as obtained 
by a single transformation — the “composition” of g 2 and g x — and 
it is customary to denote this composite by “gig” and hence the 
new figure by “gig\X” as shown in the figure. The mathematical 
definition of group requires that if g\ e G and g 2 e G then their 
composition g 2 g { must also be in G. 

Incidentally, in the case of the plane-translations, it is always true 
that£,g 2 A" = g 2 g\X, as can be seen by completing the parallelo- 
gram of X, g i X, g 2 X, and g 2 giX. This is to be regarded as a coin- 
cidence, because it is not always true in other important geometric 



Group Invariance of Boolean Inequalities 2.1 [43] 


g'eKerdt^cl ky 

groups. For example, if G is the grou^ef all rotations about all 
points in the plane, then for the indicated g { and g 2 shown below, 
the points g\g 2 x and g 2 g\X are quite different. 



Figure 2.3 Here g \ is a small rotation about p iy and g 2 is a 90° rotation 
about p 2 . The figure shows why, for the rotation group, we usually find 
ihdiig\g 2 x * gig\x. 


3. The final requirement of the technical definition of “group of 
transformations” is that for each g e G there must be an inverse 
transformation called g~\ also in G , with the property that 
g~ { gx = x for every point x. In Figure 2.2 we have indicated the 
inverses of the translations g x and g 2 . One can construct the in- 
verse of g 2 g\ by completing the parallelogram to the left. In fact a 
little thought will show that (in any group!) it is always true that 
(gig i ) -1 = gr'gi~'- 

It is always understood that a group contains the trivial “iden- 
tity” transformation e , which has the property that for all jc, 
ex = x. In fact, since e is the composition of g~ x g of any g and its 
inverse g~\ the presence of e in G already logically follows from 
the requirements of 2 and 3. It is easy to see also that gg~ l = e 
always. 

In algebra books, one finds additional requirements on groups, 
for example, that 

(g\gi)gi = g\(gigi) 

for all g\, g 2 , and g 2 . For the groups of transformations we always 
use here, this goes without saying, because it is already implicit in 



[44] 2.1 Algebraic Theory of Linear Parallel Predicates 


the intuitive notion of transformation. The associative law above 
is seen to hold simply by following what happens to each separate 
point of R. 

4. If h is a particular member of G then the set hG defined by 
hG = \hg | g e G( 

(that is, the set obtained by composing h with every member of 
G) is the whole group G and each element is obtained just once. 
To see this, note first that any element g is obtained: 


h(h 1 g ) = ( hh~')g = eg = g 


and h x g must be in the group. If, say, g 0 happens twice. 


go = hgu 
go = hg2 


we would have both of 

h~'g 0 = h~'hg x = g x 
h'g 0 = h-'hgi = g 2 

so that gi and g 2 could not be different. 

5. In most of what follows, and particularly in §2.3, we want to 
work with groups G that contain only a finite number of trans- 
formations. But still, we want to capture the spirit of the ordinary 
Euclidean transformation groups, which are infinite. There are an 
infinite number of different distances a figure can be translated in 
the plane: for example, if g ^ e is any nontrivial translation then 
g, gg, ggg , . . . are all different. In most cases we will be able to 
prove the theorems we want by substituting a finite group for the 
infinite one, if necessary by altering the space R itself! For ex- 
ample, in dealing with translation we will often use, instead of the 
Euclidean plane, a torus, as in Figure 2.4. 

The torus is ruled off in squares, as shown. As our substitute 
for the infinite set of plane-translations, we consider just those 
torus-transformations that move each point a certain number m 
of square units around the large equator, and a certain number 



Group Invariance of Boolean Inequalities 2.2 [45] 



Figure 2.4 


n of units around the small equator. There are just a finite number 
of such “translations.” The torus behaves very much like a small 
part of the plane for most practical purposes, because it can be 
“cut” and unwrapped (Figure 2.5). Hence for “small” figures and 
“small” translations there is no important difference between the 
torus and the plane. This will be discussed further in the intro- 
duction to Part II, and in Chapter 7. 






[46] 2.2 Algebraic Theory of Linear Parallel Predicates 


2.2 Equivalence-classes of Figures and of Predicates 

Given a group G, we will say that two figures F and Y are G- 

equivalent (and we write X = F) if there is a member g of G for 

/ g 

for which A" = gY. Notice that 
X = X because F = eX , 

G 

X = F implies F = A", because if F = gF then F = g _i F, 

G G 

X = F and F = Z imply F = Z, because if F = gF and F = /zZ 

G G G 

then F = g/zZ. 

When we choose a group, we thus automatically also set up a 
classification of figures into equivalence-classes. This is important 
later, because it will turn out that the “patterns”— or sets of 
figures — we want to recognize always fall into such classifications 
when we choose the right groups. 

Example: Suppose that G is the set of all permutations of a finite 
set of points R. (A permutation is any rearrangement of the points 
in which no two points are brought together.) Then (theorem!) 
two figures F and F are G-equivalent if and only if they both 
contain the same number of points. 

Example: If one wanted to build a machine to read printed letters 
or numbers, he would normally want it to be able to recognize 
them whatever their position on the page: 



That is to say that this machine’s decision should be unaffected 
by members of the translation group. A more sophisticated way 
to say the same thing is to state that the machine’s perception 
should be “translation-invariant,” that is, it must make the same 
decision on every member of a translation-equivalence class.* 


*In practice, of course, one wants more from the machine: one wants to know 
not only what is on a page, but where it is. Otherwise, instead of “reading” what is 
on the page, the machine would present us with anagrams! 




Group Invariance of Boolean Inequalities 2.2 [47] 


In §2.3 we prove an important theorem that tells us a great deal 
about any perceptron whose behavior is “(/-invariant” for some 
group <7, that is, one whose predicate f(X) depends only upon the 
equivalence-class of X. In order to state the theorem, we will have 
to define what we mean by G-equi valance of predicates . 

We will say that two predicates <p and <p’ are equivalent, with 
respect to a group G 

* 5 

if there is a member g of G such that <p(gX) and <p'( X) are the 
same for every X. 

It is easy to see that this really is an equivalence relation, that is, 

<p = (p for any <p, 

G 

<p = <p f implies <£>' = <p 

G G 

(f = if' and <p r = (p n imply <p = <p". 

G G 

Given any predicate <p and group element g, we will define <pg 
to be the predicate that, for each X, has the value (p(gX). Thus we 
always have <pg(X) = <p(gX). We will say $ is closed under G if 
for every <p in 4> and g in G the predicate pg is also in 4>. 



Three <^s Equivalent under a Rotation Group 



[48] 2.3 Algebraic Theory of Linear Parallel Predicates 


Now at last we can state and prove our main theorem. It will 
show that if a perceptron predicate is invariant under a group (7, 
then its coefficients need depend only on the G-equivalence classes 
of their <p's. This theorem will be our single most powerful tool 
in all that follows, for it is the generalization of our method of 
§2.1 and will let us convert complicated problems of geometry 
into (usually) simple problems of algebra. 

2.3 The Group-Invariance Theorem 

Suppose that 

(1) G is a finite group of transformations of a finite space R ; 

(2) <f> is a set of predicates on R closed under G; 

(3) \p is in L(<f>) and invariant under G. 

Then there exists a linear representation of \p, 


t = 


PfV > 0 


for which the coefficients depend only on the G-equivalence 
class of (p, that is, 

if ip = then P„ = 

G 

These conditions are stronger than they need be. To be sure, the theorem 
is not true in general for infinite groups. A counterexample will be found 
in §7.10. However, in special cases we can prove the theorem for infinite 
groups. An example with interesting consequences will be discussed later, 
in §10.4. It will also be seen that the assumption that G be a group can 
be relaxed slightly. 

We have not investigated relaxing condition (2), and this would be 
interesting. However, it does not interfere with our methods for showing 
certain predicates to be not of finite order. For when the theorem is 
applied to show that a particular \p is not in L(<f>) for a particular 4>, it 
is done by showing that \p is not linear even in the G-closure of <F. 
Remember that the order of a predicate (§1.3) is defined without refer- 
ence to any particular set 4> of <^’s! And closing a 4> under a group G 
cannot change the maximum support size of the predicates in the set. 

proof: Let \[/(X) have a linear representation ^c*(<£>) p (X) > 0. 



Group Invariance of Boolean Inequalities 2.3 [49] 


We use “«(<£>)” instead of a „ to avoid complicated subscripts. 
Any element g of G defines a one-to-one correspondence (p <pg, 
that is, a permutation of the (p's. Therefore 

a (<P)<P(X) = <*(<Pg)<Pg(X) 

<pe<P 

for all X, simply because the same numbers are added in both 
sums. Now, choose an X for which f(X) = /. Since \p is G-invari- 
ant, and g~ l is an element of G , we must have 

Z a(<pg)<pg{g~'X) > 0, 

hence we conclude that for any g in G, if f(X) = 1 , 

Z oi(<pg)(p(X) > 0. 

Summing these positive quantities for all g's in G, we see that 

Z Z a (<Pg) ( f > (X) > 0- 

geG 

If we collect together the coefficients for each <p, we then obtain 

Z Z a (<pg) > ° 

g(G 

which is an expression in L(<p ), that is, can be written as 

Z 0Mv(X) > o. 

Remember that this depends on assuming that f(X) = 1. Now 
choose an X for which \p(X) = 0. Then the same argument will 
show that 

Z 0(<p)<p(X) < 0. 

Combining the inequalities for \p = 1 and \p = 0, we conclude 
that 

UX) = [Z PMv(X) > o . 

<pt <f> 



[50] 2.3 Algebraic Theory of Linear Parallel Predicates 


It remains only to show, as promised, that 
<P = <p' => P(<p) = P(<p')- 

G 

But ip = ip' means that there is some h such that ip = ip'h, so 

&(<P) = Z <*(</> g) = Z a (v'hg) = Z a(<p'g ) = P(<p') 

g(G g(G gtG 

because the one-to-one correspondence g «-► hg simply permutes 
the order of adding the same numbers. 





second proof: Because of the importance of the theorem, we 
give another version of the propf, which may seem more intuitive 
to some readers. 

Choose an X for which \p(X) = 1. Then for any g e G we will 
have \p(gX) = 1, hence each of the sums 

Z a M<p(gX) 

will be positive, and so will their sum 

Z a(<p)<p(gX) = Z aMvgVT)- 

<pt< I> ipe$ 

geG gtG 

We can think of all the terms of this sum as arranged in a great 



Group Invariance of Boolean Inequalities 2.3 [51] 


|$| x \G\ array 

<*(<Pl)<P\gl + OL(<p 2 )(P2g\ + 

+ tt(<P \)<P \g2 +■ «(<^2)^2g2 + 

h OL(<P\)<P\g\ G \ + 


+ a(<Pm)<P\*\g\ 
+ «(^|*|) ^ 1 * 1^2 


(X). 


+ a{(p\^\)if\^\g\G\ 


We want to write this in the form finp j + /? 2 2 + ... so we have 
to collect the coefficients of each (p h To do so, we have to collect 
together for each those terms 


<*((?,) for which <pjg k = <p im 

The sum of those terms is, of course, fi h Our real purpose, how- 
ever, is not to calculate fi t but to show that 


(p a — if b — ► fi a — fi b . 


To do this, suppose that in fact <p a = (p b , which implies that we 
can find an element g for which 

<Pa = <Pbg- 

We will use this to establish a one-to-one correspondence between 
the two sets of elements of the array that add up to form fi a and 
fi b . Define 

“theg 7 -entry of 

to be a(<p i) pig j where i is determined by (pigj = <p k . Then in 
the array there is exactly one “gy-entry of (p k ” for each j and k. 
(It is irrelevant that there may be many different elements h in G 
that satisfy (p t h = <p k . We are concerned here only with each 
entry’s occurrence in the array, not with its value.) 

Now, if a((pi) (pigj is thegyentry of (p b , then 


<Pigj = <Pb , 



[52] 2.3 Algebraic Theory of Linear Parallel Predicates 


and therefore 
Vigjg = <Pbg = (p a ; 

hence <p ( gjg is the gjg entry of <p a . 

If we recall that 
gj ++ gjg 

is a one-to-one correspondence within the group elements, as 
shown in observation 4 of §2.1.1 (see Figure 2.6), we conclude 
that the corresponding elements in the ( 3 a and / 3 b sums must have 
the same coefficients, so the sums 0 a and f3 b must be equal. 



Figure 2.6 



Since the same argument holds for the case of = 0, the 
theorem is proved. Extensions of this to certain infinite spaces are 
discussed in Chapters 7 and 10. 

For readers who find these ideas difficult to work with abstractly, 
some concrete examples of the equivalence classes will be useful; 
the geometric “spectra” of §5.2 and especially of §6.2 should be 
helpful. 


W e shall often use this theorem in the following form : 

Corollary 1: Any group-invariant predicate ^ (of order k) that 
satisfies the conditions of the theorem has a representation 


t = 


a in (p > 0 



Group Invariance of Boolean Inequalities 2.3 [53] 


where <L* is the set of masks (of degree < k) and a £ = ay if 
S((p) can be transformed into S(<p') by an element of G. 

proof : For masks, A = (p B if and only if A = gB for some g e G. 

G 

Corollary 2: Let 4> = $i (J • • • (J be the decomposition of <f> 
into equivalence classes by relation = . Then, under the conditions 

of the theorem i p can be written in the form 
t = IZcX'N'iX) > 01 

where /V,(T) = | | and <p(X) j | , that is, the number of 

<p\ of the zth type, equivalent under the group, satisfied by the 
figure X. 


proof: \p can be represented as 


X! a* V > 0 


, that is, 


XI XI > 0 


, that is. 


Z«,X>>0 

i if t $>j 


Z a,N,(X) > 0 

i 


2.4 The Triviality of Invariant Predicates of Order 1: 

First application of the group-invariance theorem. 

Theorem 2.4: Let G be any group of permutations on R that has 
the property:* for every pair of points ( p,q ) of R there is at least 
oneg in G such that gp - q. Then the only first-order predicates 
invariant under G are 


MX) - \ \X\ > ml] 
hx) = r \X\ > ml 
HX) = \\X\ < ml 
HX) = f \X\ < ml J 


for some m. 


*This property, shared by most of the interesting geometric groups, is usually 
called “transitivity.” Pure rotations about a fixed center constitute an exception, 
as does the group of all translations parallel to a fixed direction in the plane. 
But the groups of all rotations about all centers, or all translations, etc., are 
transitive. 



[54] 2.4 Algebraic Theory of Linear Parallel Predicates 


proof: Since all the one-point predicates < p p are equivalent, we 
can assume that 


HX) 


X! onp p > 6 , 

peX 


that is, the coefficients are independent of p. But 2 a <p p > 6 is the 
same as 2^ > 0/a for a > 0. (For negative a we have to use 
“<” instead.) And 

H<P P = \x\ . 

ptX 

Thus order- 1 predicates invariant under the usual geometric 
groups can do nothing more than define simple “> w”-type in- 
equalities on the size or "area” of the figures. In particular, taking 
the translation group G we see that no first-order perceptron can 
distinguish the A’s in the figure on p. 46 from some other transla- 
tion-invariant set of figures of the same area. 

2.4.1 Noninvariant Predicates of Order 1 

If one gives up geometric group invariance there are still some 
simple but useful predicates of order 1, for we can represent in- 
equalities related to the ordinary integrals. For example, the 
following predicates of plane figures can be realized: let x p and 
y p be the x and y coordinates of the point p: 



Figure 2.7 



Group Invariance of Boolean Inequalities 2.4 [55] 


\Xhas more area in the right half-plane than in the left] 

= X, <p P - 21 <Pr > ° . 

right left 

half half 

[The center of gravity of X is right of centerl 
= [2 x p <p p > 0] (see Figure 0.3, p. 11), 

[The "th central moment of X about the origin is greater than 01 

= |" 2 <p p ( Vx 2 p + . ylY > 6], 

and so on. But these “moment-type” predicates are restricted to 
having their reference coordinates in the absolute plane, and not 
in the figure X. For example one cannot realize, with order 1, 

[The second moment of X about 

its own center of gravity is greater than 0] 

because that predicate is invariant under the (transitive) transla- 
tion group. 

MATHEMATICAL REMARK! 

There is a relation between these observations and Haar’s theorem on the 
uniqueness up to a constant factor of invariant measures. For finite 
sets and transitive groups the unique Haar measure is, in fact, the 
counting-function ijl{X) = \X\. 

The set function defined by 

M( X ) = ^ 1 &iXi = ^ 'j OLi 

x ( eX 

satisfies /z(T) + ^i(Y) = ii(X U Y) + /i(X D Y). If we defined in- 
variance by n(X) = v(gX) it would follow immediately from Haar’s 
theorem that /i(X) = c \X\ , where c is a constant. Our hypothesis on n 
is slightly weaker since we merely assume 

H(X) > 0<=>n(gX) > 0, 

and deduce a correspondingly weaker conclusion, that is, 

(n(x) > e) 4=Xc\x\ > 0 ). 

In the general case the relation between an invariance theorem and the 
theory of Haar measure is less clear since the set function 2 a^(p(X) is 
not in general a measure. This seems to suggest some generalization of 
measure but we have not tried to pursue this. Readers interested in the 
history of ideas might find it interesting to pursue the relation of these 
results to those of Pitts and McCulloch [1947]. 



Parity and One-in-a-box Predicates 

3 


3.0 

In this chapter we study the orders of two particularly interesting 
predicates. Neither of them is “geometric,” because their in- 
variance groups have too little structure. But in §5, we will apply 
them to geometric problems by picking out appropriate “sub- 
groups” which have the right invariance properties. 

3.1 The Parity Function 

In this section we develop in some detail the analysis of the very 
simple predicate defined by 

^AparityUO = f M is an odd numberl. 

Our interest in \p PARIXY is threefold: it is interesting in itself; it will 
be used for the analysis of other more important functions; and, 
especially, it illustrates our mathematical methods and the kind of 
questions they enable us to discuss. 

Theorem 3,1.1: \p PxmTY is of order \R\ . That is, to compute it re- 
quires at least one partial predicate whose support is the whole 
space R. 

proof: Let G be the group of all permutations of R. Clearly 
\[/ parity is invariant under G. (Because moving points around can’t 
change their number!) 

Now suppose that ^ PARIXY = > 01 where the <?/ are the 

masks with |S(<p/) | < K. The group invariance theorem tells us 
that we can choose the coefficients so that they depend only on the 
equivalence classes defined by = . 

But then depends only on |S(<p/) |. To see this observe (1) all 
masks with the same support are identical, and (2) all sets of the 
same size can be transformed into one another by elements of G, 

<Pi = <Pj <=» |S(*,)| = l*S'(^ j/ )l • 

G 


Thus \^ PARITY can be written, using Corollary 2 of §2.3, as 


“ K 

Z a i 

Z 

> 0 

_ 

- K 

Z ajNjvn > o 

7 = 0 

L*y J 



j 



Parity and One-in-a-box Predicates 3.1 [57] 


where {<F,j is the set of masks whose supports contain exactly j 
elements. We now calculate for an arbitrary subset X of R , 


Nj(X) = Z <p(X). 

<pt$j 


Since <p(X) is 1 if S(<p) C X and 0 otherwise, Nj(X) is the number 
of subsets of X with j elements, that is, 

\x\\ _ \X\(\X\ - 1) ... (1*1 — j + 1 ) 

j ) j ! 

which is a polynomial of degree j in |^|. It follows that 

K 

Z «,Nj{X) 

7 = 0 

is a polynomial of degree < K in \X\, say P( \X\). 

Now consider any sequence of sets X 0 , X u ..., X \ R \ such that 
\Xi | = /. Since P{ \X\ ) > 0 if and only if \X\ is odd, 

P(\X 0 \) < 0, P( \X x I) > 0, P(\X 2 \ < 0,..., 

crosses He X-<vris 

that is P(\X\) I J? | times as \X\ increases 

from 0 to |/?| . But P is a polynomial of degree K. It follows (see 
Figure 3.1) that K > |/?| . Q.E.D. 



f^j 





/So 

0 

/ 

/ A 

\ 

= 

= 



= 


Figure 3. 1 A polynomial that changes direction K - 1 times 
must have degree at least K. 



[58] 3.1 Algebraic Theory of Linear Parallel Predicates 


From this we obtain the 


Theorem 3.1.2: If ^ PARITY e L($) and if <£ contains only masks, 
then <l> contains every mask . 

proof: Imagine that, even if contains only masks, and the mask 
whose support is A does not belong to <£, it were possible to write 


* 


PARITY 


23 > ° 

<pt <f> 


Now define, for any \p, the predicate \p A (X) to be \ p(X P) A). 
Then \^p AR i TY ls the parity function for subsets of A , and has order 
\A\ by the previous theorem. To study its representation as a 
linear combination of masks of subsets of A we consider (p A for 
<p e <F. If S((p) C A, clearly (p A = otherwise <p A is identically 
zero since 

s(ip) (t a =» s(<p) dx n a =* ip(x n a) = o 

r ' =* ip A {X) = 0 . 


Thus, either S{\p A ) is a proper subset of A, or <p A is identically 
zero. Now let $ A be the set of masks in $ whose supports are 
subsets of A. Then 




A 

PARITY 


> 0 


And for all ipe$ A , \S((p) | <\A\. But this is in contradiction 
with Theorem 3.1.1 since it implies that the order of \p PARlTY is less 
than \A | . Thus the hypothesis is impossible and the theorem 
proven. 


Corollary 1: If \p PARITY e L($) then $ must contain at least one tp 
for which |S(</?)| = \R\. 


The following theorem, also immediate from the above, is of 
interest to students of threshold logic: 

Corollary 2: Let <£ be the set of all ^parity for proper subsets A of 
R. Then ^parity f £($)• 

The further analysis of ^ PARITY in Chapter 10 shows that functions 
which might be recognizable, in principle, by a very large percep- 


Parity and One-in-a-box Predicates 3.2 [59] 



Group-invariant coefficients for the | = 3 parity predicate. 


tron, might not actually be realizable in practice because of im- 
possibly huge coefficients. For example, it will be shown that in 
any representation of ^ PARITY as linear in the set of masks the ratio 
of the largest to the smallest coefficients must be 2 |/?l _1 . 

3.2* The “One-in-a-box” Theorem 

Another predicate of great interest is associated with the geo- 
metric property of “connectedness.” Its application and interpre- 
tation is deferred to Chapter 5; the basic theorem is proved now. 

Theorem 3.2: Let A u . . . , A m be disjoint subsets of R and define 
the predicate 

= \ \X C\ A t \ > 0, for every A ,1, 

that is, there is at least one point of X in each A If for all /, 
\A i | = 4m 2 , then the order of \[/ is > m. 


*This theorem is used to prove the theorem in §5.1. Because §5.7 gives an in- 
dependent proof (using Theorem 3.1.1), this section can be skipped on first 
reading. 






[60] 3.2 Algebraic Theory of Linear Parallel Predicates 


Corollary: If/? = A, U A 2 U ■ ■ • U A m , the order of \p is at least 

il R\' n . 

proof: For each i = 1, . . . t m let G ; be the group of permuta- 
tions of R which permute the elements of A , but do not affect the 
elements of the complement of A h 

Let G be the group generated by all elements of the G t . 

Clearly is invariant with respect to G. 

Let $ be the set of masks of degree k or less. To determine the 
equivalence class of any <p e <i> consider the “occupancy numbers”: 

ls(*) n A t |. 

Note that (p, = <p 2 if and only if |5(<pi) n I = IS 1 ^) H A, | 

G 

for each i. Let $ 2 , ... be the equivalence classes. 

Now consider an arbitrary set X and an equivalence class We 
wish to calculate the number Nj(X) of members of <$, satisfied by 
X , that is, 

Nj(X)~ \W\vtij AND S(<p) C A r ||. 


A simple combinatorial argument shows that 

\x n A m 
Is(^) n aJ 


where 

y\ = y(y - 0 ••• (y - « + 0 

n n\ 



Nj(X) = 


\xhaa \ i \xnA 2 \ 
K \S(<p) n A ,| / \|S(*>) n A : 


and (p is an arbitrary member of <£/. Since the numbers |S(^>) Pi 
A 1 1 depend only on the classes and add up to not more than 
k, it follows that Nj(X) can be written as a polynomial of degree 
k or less in the numbers x, = \X p A t \: 


Nj(X) = Pj(x u ..., x m ). 



Parity and One-in-a-box Predicates 3.2 [61] 


Now let [2a^<p > 01 be a representation of ^ as a linear threshold 
function in the set of masks of degree less than or equal to k. By 
the argument which we have already used several times we can as- 
sume that a p depends only on the equivalence class of <p and write 


Za^(X) = 2ft 


Z v(X) 


= 20jNj(X) = 2 PjPjOct 


j Xm) 


which, as a sum of polynomials of degree at most k , is itself such a 
polynomial. Thus we can conclude that there exists a polynomial 
of degree a most k , Q{x x , . . . , x m ) with the property that 


HX) = \Q(x u ...,x m ) > 01 (x, = df \X n A,\), 


that is, that if all x t lie in the range 
0 < Xi < 4m 2 , 


then 


Q(x i,...,x m )>0 <=> Xi > 0 for all /. 

In Q(x \, . . . , x m ) make the formal substitution, 

= [t - (2 i - l)] 2 . 

Then Q(x i, . . . ,x m ) becomes a polynomial of degree at most 2k 
in t. Now let t take on the values t = 0, 1, ... , 2m. Then 

Xi = 0 for some /, if t is odd; in fact, for i = \{t + 1); 

but 


Xi > 0 for all i, if / is even. 

Hence, by the definition of the \p predicate, Q must be positive for 
even t and negative or zero for odd t. By counting the number of 
changes of sign it is clear that 2k > 2m, that is, k > m. This 
completes the proof. 



The “And/Or” Theorem 

4 


4.0 

In this chapter we prove the “And/Or” theorem stated in §1.5. 

Theorem 4.0: There exist predicates \p x and \p 2 of order 1 such that 
\p x A ypi and V \[/ 2 are not of finite order . 

We prove the assertion for \p\ A ^ 2 - The other half can be proved 
in exactly the same way. The techniques used in proving this 
theorem will not be used in the sequel and so the rest of the 
chapter can be omitted by readers who don’t know, or who dis- 
like, the following kind of algebra. 

4.1 Lemmas 

We have already remarked in §1.5 that if R = A U B U C the 
predicate \\X H A \ > \X Pi C|l is of order 1, and stated without 
proof that if A, B, and C are disjoint (see Figure 4.1), then 



\{\x n a | > \x n c|) a (|* n b\ > \x n c|)i 

is not of bounded order as |/?| becomes large. We shall now 
prove this assertion. We can assume without any loss of generality 
that the three parts of R have the same size M = \A \ = \B\ = |C|, 
and that \R\ = 3 M. We will consider predicates of the stated 
form for different-size retinas. We will prove that 

If \p M {X) is the predicate of the stated form for \R\ = 3 M, then 
the order of \p M increases without bound as M °o . 

The proof follows the pattern of proofs in Chapter 3. We shall 
assume that the order of {\p M } is bounded by a fixed integer N 



The “And/Or” Theorem 4.1 [63] 


for all Af, and derive a contradiction by showing that the asso- 
ciated polynomials would have to satisfy inconsistent conditions. 
The first step is to set up the associated polynomials for a fixed 
M. We do this by choosing the group of permutations that leaves 
the sets A, B , and C fixed but allows arbitrary permutations 
within the sets. The equivalence class of a mask p is then charac- 
terized by the three numbers, | A H -S'(<^)|, | B H *S(<^)|, and 

| C n 5 (^) 1 . For any given mask <p and set X the number of 
masks equivalent to p and satisfied by X is 

_ / \a n x\ \ / \b n x\ \ i \c n x\ \ 

N ^ X) ” \U n sop)|j x VI* n s{<p)\j x \|c n s^)\)' 

Since we are assuming |5 , (<^)| < N, we can be sure that N ^(X) 
is a polynomial of degree at most N in the three numbers 

x = \a n x |, >> = 1*0*1, z = |c n *|. 

Let $ be the set of masks with |support| < N. Enumerate the 
equivalence classes of $ and let N,-(X) be the number of masks 
of the ith class satisfied by X. The group invariance theorem 
allows us to write 

iM*) = WPiNAX) > 01. 

The sum IjfiiN^X) is a polynomial of degree at most N in x,y,z. 
Call it P M (x,y , z). 

Now, by definition, for those values of x,y, z which are possible 
occupancy numbers, that is, nonnegative integers < M, 

P M (x,y,z) > 0 if and only if x > z and y > z. 

We shall show, through a series of lemmas, that this cannot be 
true for all M. 

Lemma 1 : Let P\ (x, y, z), P 2 (x, y , z), . . . , be an infinite sequence of 
nonzero polynomials of degree at most N, with the property that 
for all positive integers x,y , z less than M 

x > z and y > z implies P M {x,y,z) > 0 

SEPARATION CONDITIONS 

x < z or y < z implies P M (x,y\z) < 0. 



[64] 4.1 Algebraic Theory of Linear Parallel Predicates 


Then there exists a single nonzero polynomial P(x f y,z) of degree at 
most N with the property that the separation conditions , with P in 
the place of P M , hold for ALL positive integral values of x,y,z. It 
should be observed that we have had to weaken the separation 
conditions by allowing equality in both conditions since inequal- 
ity would not be preserved in the limit. Consequences of this will 
make themselves felt in the proof of Lemma 2. 

proof: Write 


T 

P M (x,y,z) = X C MJ m t {x, y, z), 

/« 1 


where m x , m 2l . .., m T is an enumeration of the monomials of 
degrees < N in x, y , z. 

Since the conditions on P M are preserved under multiplication by 
a positive scaling factor, we can assume that 

2C ! „, f = 1. 

Now consider the set of points in T-dimensional space 

Cm = (^a/,1, £ a/,2, • • - , M - 1,2,.... 

These all lie in a compact* set — the surface of the unit T-dimen- 
sional sphere. There is, therefore, a subsequence C Mj which con- 
verges to a limit 

C Mj C (c j , Cj , i c f) 

in the sense that, for each i, 

lim c M j — c t . 

The polynomial 

T 

P(x,y,z) = ^ Cjmi(x,y,z) 

/= 1 


See index. 


The “And/Or” Theorem 4.1 [65] 


inherits the separation conditions for all positive integral values 
of x ,y,z. That it is not identically zero follows from the fact 
that the c, inherit the condition 2) cf = 1 . 

In order to prove our main theorem, we first establish a corre- 
sponding result for polynomials in two variables, and later 
(Lemma 3) adapt it to P(x,y, z). 

Lemma 2: If a polynomial f(a,f3) satisfies the following conditions 
for all integral values of a and /3, then it is identically zero: 

a > 0 and 13 > 0 implies f(a,/3) > 0, 

a < 0 or 13 < 0 implies /(a,/?) < 0. 

proof: Suppose that /(«,/?) could satisfy these conditions yet 
not be identically zero. Then we could write it in the form 

/(<*,/?) = p N g(a) + r(a^) 

with #(a) not identically zero and with r(a,/3) of degree lower 
than N in /3. We can then find a number a 0 > 0 such that neither 
of g(dhao) is zero, and then we can choose a number I3 0 so large 
that all four of the inequalities 

|/8og(±«o)l > k(±«o, ±0o)l 

are satisfied so that /-(±a 0 , ±/? 0 ) cannot affect the sign of /(±a 0 , 
d=/? 0 ). Then, since 

/( -aoA) < 0 

we have 

g(-«o) < 0; 
hence 

(-0o) N g(-<*o) > 0; 
hence 


f(-a 0 ,-/3 0 ) > 0; 

which contradicts the conditions, and hence proves the lemma. 



[66] 4.2 Algebraic Theory of Linear Parallel Predicates 


4.2 A Digression on Bezout’s Theorem 

Readers familiar with elementary algebraic geometry will observe 
that the lemma would follow immediately from Bezout’s theorem 
if the conditions could be stated for all real values of a and /?. We 
would then merely have to prove that the doubly infinite L of the 
Figure 4.2 is not an algebraic curve. 


fK/3Ko 

$(<*-, /3) >0 
+ oc 

~/3 

+ /3 




f(° <0/3) 4 o 

Figure 4.2 



Bezout’s theorem tells us that if the intersection of an algebraic 
curve L with an irreducible algebraic curve Y contains an infinite 
numbpr of points, it must contain the whole of Y. But the L con- 
tains the positive half of the y axis. Straight lines are irreducible, 
so L would have to contain the entire y axis if it were algebraic. 

Unfortunately, because our conditions hold only on integer lat- 
tice-points, we must allow for the possibility that f(a,0) = 0 
takes a more contorted form as, for example, in Figure 4.3. Part 
of the pathological behavior of this curve is irrelevant. Since a 
polynomial of degree N can cut a straight line only N times, the 
incursions into the interiors of the quadrants can be confined to a 
bounded region. This means that the curve /(a, £?) = 0 must 
“asymptotically occupy” the parts of the “channel” illustrated 
in Figure 4.4. 






[68] 4.2 Algebraic Theory of Linear Parallel Predicates 


It seems plausible that a generalization of Bezout’s theorem could 
be formulated to deduce from this that the curve must enter the 
negative halves in a sense that would furnish an immediate and 
more illuminating proof of our lemma. We have not pursued this 
conjecture. 

Lemma 3: If a polynomial P{x,y,z) satisfies the following condi- 
tions for all positive integral values of x, y, and z, then it is iden- 
tically zero: 

x > z and y > z implies P(x,y,z) > 0, 
x < z or y < z implies P(x,y,z) < 0. 

proof: Suppose that P(x,y,z) had these properties, but were not 
identically zero. Define Q(a,f3,z) = P(z + a, z + P,z) and write 

Q(oc,P,z) = z M f(a,P) 4- r(a,0,z), 

where r is of degree less than M in z, and / (a, 0) is not identically 
zero. Then we can show that / must satisfy the conditions in 
Lemma 2: Choose any a 0 and 0 O for which / (a 0 , 0o) ^ 0. Choose 
a z 0 so large that 

Zo 4- <*0 > o, z 0 + 00 > o, and \zo/(a 0 , 0 O )I > |/*(a 0 , 0 O , z 0 )|. 

It follows that /(« o ,0o) - 0 Q(<* o,Po, z o) > 0, that is, 

if and only if P(z 0 + a 0 , z 0 4- 0 O , z 0 ) > 0. Thus 

olq > 0 and 0 O > 0 => z 0 4- a 0 > z 0 and z 0 4- 0 O > z 0 
=► P(zo + ot 0 ,z 0 4- 00, z 0 ) > 0 
=* f(a 0j p 0 ) > 0, and similarly, 

«o < 0 OR 0 O < 0 =» f(a 0 ,p 0 ) < 0. 

But this is true for all a 0 , 0 O . Thus by the Lemma 2, / ( a , 0) = 0. 
It follows that z) is of degree zero in z, which is only pos- 

sible if Pis identically zero. 

This concludes the proof of the Ana/Or Theorem. 



GEOMETRIC THEORY 
OF LINEAR INEQUALITIES 




[70] Geometric Theory of Linear Inequalities 


Introduction to Part II 

The analysis of geometric properties of perceptrons begins, in 
Chapter 5, with the study of the predicate ^connected 1 Is the 
figure X all connected together in the sense that between any two 
points of figure there is a continuous path that lies entirely within 
the figure (see §0.5)7 We chose to investigate connectedness be- 
cause of a belief that this predicate is nonlocal in some very deep 
sense; therefore it should present a serious challenge to any 
basically local, parallel type of computation. Originally, we tried 
to prove that ^connected is not of finite order by exploiting its 
sensitivity to small changes in X — any connected figure is easily 
converted to a disconnected figure by making a thin cut or by 
adding an isolated point — but we were unable to convert this to a 
real proof. 

The successful methods were based on using the group-invariance 
theorem, but indirectly. We recall that in dealing with ^ PAR1XY we 
began by identifying the largest possible group of transformations 
of R that leaves \p invariant — in the case of ^ PARITY the group of all 
permutations. We then used this group to coalesce the <p’s into 
equivalence classes, and eventually reduced the problem about 
representing ^ in £(<£) to a problem about polynomials in 
enumeration functions. 

But in the case of ^ CONNECTED , we find that any attempt to apply 
this technique directly leads to severe problems associated with 
the representation of a general topological transformation on a 
discrete retina. Fortunately, it was possible to “reduce” the prob- 
lem to a simple one involving more tractable groups. In fact, we 
see in §5.1 that if a perceptron could discriminate between just 
certain restricted instances of connectedness, then it could be 
made to simulate the “one-in-a-box” predicate of §3.2. If this 
were possible, we would have, logically: 

connected is finite order == ^ ^connected | restricted is finite order 
=► i^one-in-a-box is finite order, 

and since the last is false, so is the first. 

Toward the end of Chapter 5, this firmly negative result — that 
^connected is not of finite order — is generalized to show that the 
same is true of all topological predicates, with one single type of 
exception. Only the Euler number , the lowest and simplest of all 



Introduction to Part II [71] 


Yo\r 'Hiose v , eacWi^s ii/ifefec Wd i* t c*<£ 
pey-cep'f/on "Hi osc concerned <ju i H 

p^ac-hcci ( CX pji | I'ks, (Viec^a+ i\/€) \res"<z H-S. 

l n (> 6.6 dej^y-v/e wu/ck, mcV'c evu ph (*£{£ 

the topological invariants, can be recognized by the finite-order 
predicate-scheme. 

In Chapter 6 we obtain a series of positive results. There are a 
variety of geometric properties, in addition to ^convex and ^circle 
mentioned in §0.5, that are quite clearly of finite (and in fact of 
rather low) order. These include particular forms like triangles or 
squares or letters of the alphabet. From some of these a type of 
description emerges that we call “geometric spectra.” These can 
be regarded either as local geometric properties or as simple sta- 
tistical qualities of the patterns. The fact that perceptrons can 
recognize certain patterns related to these spectra is probably 
responsible for some of the false optimism about the capabilities 
of perceptrons in general. At the end of Chapter 6 we see that 
while these patterns can be identified in isolation, the perceptron 
cannot detect them in more complicated contexts. 


Chapter 7 is a curious detour. It turns out that certain predicates 
that do not seem at first to have finite order — such as symmetries , 
or similarities between pairs of figures — can in fact be realized by 
finite-order predicate-schemes. But the realizations have a 
peculiar unreality, for their coefficients grow at such astronomical 
rates as to be physically meaningless. The incident seems to have 
an important moral; even within a simple combinatorial subject 
such as this, one must be on guard for nonobvious codes or 
representations of things. The linear forms obtained by the “strat- 
ification” method of Chapter 7 have a quality somewhat like the 
Godel numbers of logic, or the “nonstandard models” of mathe- 
matical analysis. Our intuition is still weak in the field of com- 
putation, and there are surely many more surprises to come. 


We study the diameter-limited perceptron in Chapter 8. Here the 
situation is much simpler, and one does not even need the alge- 
braic theory to obtain generally negative results. For the most 
part, it turns out that the diameter-limited machines are subject 
to limitations similar to those of the order- 1 machines. In certain 
respects they seem different: for example, in their ability to ap- 
proximate certain integral-like computations. This makes it pos- 
sible for them to recognize iA CIRCL e within some accuracy limita- 
tions. And, they can compute a narrowly limited class of predi- 
cates related to the Euler number. 


The predicate ^connected seemed so important in this study that we 
felt it appropriate to try to relate the perceptron’s performance 



[72] Geometric Theory of Linear Inequalities 


to that of some other, fundamentally different, computation 
schemes. In Chapter 9 we study it in the context of a wide variety 
of models for geometric computation. We were surprised to find 
that, for serial computers, only a very small amount of memory 
was required. One might have supposed that something like a 
“push-down list” would be needed so that the machine could 
retrace its steps in the course of exploring the maze of possible 
paths through a figure. 


Representing Geometrical Patterns 

We are about to study a number of interesting geometrical predi- 
cates. But as a first step, we have to provide the underlying space 
R with the topological and metric properties necessary for de- 
fining geometrical figures; this was not necessary in the case of 
predicates like parity and others related to counting, for these 
were not really geometric in character. 

The simplest procedure that is rigorous enough yet not too 
mathematically fussy seems to be to divide the Euclidean plane, 
E 2 , into squares, as an infinite chess board. The set R is then 
taken as the set of squares. A figure X E of E 2 is then identified 
with the squares that contain at least one point of X E . Thus to any 
subset X E of E 2 corresponds the subset X of R defined by 

xeX if at least one point of X E lies in the square x. 


Now, although X and X E are logically distinct no serious confusion can 
arise if we identify them, and we shall do so from now on. Thus we refer 
to certain subsets of R as “circles,” “triangles,” etc., meaning that they 
can be obtained from real circles and triangles by the map X E — * X. Of 
course, this means that near the “limits of resolution” 
one begins to obtain apparent errors of classification 
because of the finite “mesh” of R. Thus a small circle 
will not look very round. 

If it were necessary to distinguish between E 2 and R 
we would say that two figures X E , X’ E of E are in the 
same R-tolerance class if X = X'. There is no problem 
with the translation groups that play the main roles in Chapters 6, 7, 
and 8. There is a serious problem of handling the tolerances when dis- 
cussing, as in §7.6, dilations or rotations. Curiously, the problem does 
not seem to arise in discussing general topological equivalence, in Chapter 
5, because we can prove all the theorems we know by using less than the 
full group of topological transformations. 


■- 

il 

— 

— 


-| 

H— 

- 

-Jp 



js 

| 


- 

WT 

- 

- 

- 

- 















M 

-- 

r r 




! 

T 


1 1 









^ connected * A Geometric Property 
with Unbounded Order 

5 


5.0 Introduction 

In this chapter we begin the study of connectedness. A figure X 
is connected if it is not composed of two or more separate, non- 
touching, parts. While it is interesting in itself, we chose to study 
the connectedness property especially because we hoped it would 
shed light on the more basic, though ill-defined, question of 
local vs. global property. For connectedness is surely global. 
One can never conclude that a figure is connected from isolated 
local experiments. To be sure, in the case of a figure like 



one would discover, by looking locally at the neighborhood of the 
isolated point in the lower right corner, that the figure is not 
connected. But one could not conclude that a figure is connected, 
from the absence of any such local evidence of disconnectivity. 
If we ask which one of these two figures is connected 



Figure 5.1 


it is difficult to imagine any local event that could bias a decision 
toward one conclusion or the other. Now, this is easy to prove , 
for example, in the narrow framework of the diameter-limited 
concept of local (see §0.3 and Chapter 8). It is harder to establish 
for the order-limited framework. But the diameter-limited case 
gives us a hint: by considering a particular subclass of figures 
we might be able to show that the problem is equivalent to that 




[74] 5.0 Geometric Theory of Linear Inequalities 


of recognizing a parity , or something like it, and this is what we 
in fact shall do. 


5.1* The Connectedness Theorem 

Two points of R are adjacent if they are squares with a common 
edge.t A figure is connected if, given any two points (that is, 
“squares”) p\,p 2 of the figure, we can find a path through adja- 
cent points from p x to p 2 . 

Theorem 5,1: The predicate ^connected (jO = fjf is connected] is 
not of finite order (§1.6), that is, it has arbitrarily large orders 
as |/?j grows in size. 

proof: Suppose that ^connected (X) could have order < m. Con- 
sider an array of squares of R arranged in 2m + 1 rows of 4m 2 
squares each (Figure 5.2). Let Y 0 be the set of points shaded in the 
diagram, that is, the array of points in odd-numbered rows, and 
let T, be the remaining squares of the array. Let F be the family 
of figures obtained from the figure F 0 by adding subsets of Y\ % 



• 




r-i 

rA 


ns 

§§ 

su 

m 

Hr 

row 2m 




I 


you) Z*[Yl 

H 

jjf 

§1 

Jfjj 



Figure 5.2 



*We will give two other proofs from different points of view. The proof in §5.5 
is probably the easiest to understand by itself, but the proof in §5.7 gives more 
information about the way the order grows with the size of R. 

tWe can’t allow corner contact, as in^\l, to be considered as connection. 
For this would allow two “curves” to cross without “intersecting” and not 
even the Jordan curve theorem would be true. The problem could be avoided by 
dividing E 2 into hexagons instead of squares! 


A Geometric Property with Unbounded Order 5.2 [75] 


that is, X e F if it is of the form Y 0 U X u where X x C Y\. Now X 
will be connected if and only if X x contains at least one square 
from each even row; that is, if the set X x satisfies the “one-in-a- 
box” condition of §3.2. 

To see the details of how the one-in-a-box theorem is applied, if it 
is not already clear, consider the figures of family F as a subset 
of all possible figures on R. Clearly, if we had an order-/: predicate 
^connected that could recognize connectivity on R , we could have 
one that works on F; namely the same predicate with constant 
zero values to all variables not in Y 0 U Y x . And since all points 
of the odd rows have always value 1 for figures in F, this in turn 
means that we could have an order-/: predicate to decide the one- 
in-a-box property on set Y x \ namely the same predicate further 
restricted to having constant unity values to the points in Y 0 . Thus 
each Boolean function of the original predicate ^connected is 
replaced by the function obtained by fixing certain of its variables 
to zero and to one; this operation can never increase the order of 
a function. But since this last predicate cannot exist, neither can 
the original ^connected - This proof shows that ^connected has order 
at least C |/?| 1/3 . In §5.7 we show it is at least C |/?| 1/2 . 

5.2 An Example 

Consider the special case for k = 2, and the equivalent one-in-a- 
box problem for a space of the form 



in which m = 3 and there are just 4 squares in each box. Now 
consider a \p of degree 2; we will show that it cannot characterize 
the connectedness of pictures of this kind. Suppose that \p = 
> 0] and consider the equivalent form, symmetrized under 
the full group of permutations that interchange the rows and 



[76] 5.2 Geometric Theory of Linear Inequalities 


permute within rows.* Then there are just three equivalence- 
classes of masks of degree < 2, namely: 

Single points: <p ) = x h 

Point-pairs: if)) = x t Xj (x f and Xj in same row), 

Point-pairs: ip\] = x t Xj (x, and Xj in different rows). 

Hence any order-2 predicate must have the form 

i// = \a x N x (X) + a u N n (X) + a l2 N l2 (X) > 0} 

where TVj, N n , and N X1 are the numbers of point sets of the 
respective types in the figure X. Now consider the two figures: 


1 

1 

i 

i 

l§ 

II 

fg 

§§ 




i 

iH 

H 

P 





n 

8 

Hf 

i§ 

a 


^CONNECTED (Tl ) — 1 



^CONNECTED (^2) — 0 


In each case one counts 


N x = 6, N n = 6, N X2 = 9; 

hence \[/ has the same value for both figures. But X\ is connected 
while X 2 is not! Note that here m = 3 so that we obtain a con- 
tradiction with \Ai | = 4, while the general proof required \A { | = 
4 m 2 = 36. The same is true with \A f | = 3, m = 4, because (3, 1, 
1, 1) ^ (2, .2, 2, 0). It is also known that if m = 6, we can get a 
similar result with \A t \ = 16. This was shown by Dona Strauss. 

The case of m = 3, \A t | = 3 is of order 2, since one can write 

CONNECTED = [3AW - 2 N U (X) > 81. 


*Note that this is not the same group used in proving Theorem §3.2. There we 
did not use the row-interchange part of the group. 


A Geometric Property with Unbounded Order 5.4 [77] 


The proof method used in these examples is an instance of the use 
of what we call the “geometric /7-tuple spectrum,” and the general 
principle is further developed in Chapter 6. 


5.3 Slice-wise Connectivity 

It should be observed that the proof in §5.1 applies not only to the 
property of connectivity in its classical sense but to the stronger predicate 
defined by: 

\ There is a straight line L such that X does not intersect L and does not lie 
entirely to one side ofL). 

The general definition of connectedness would have “curve” for L in- 
stead of “straight line,” and one would expect that this would require a 
higher order for its realization. 


5.4 Reduction of One Perceptron to Another 

In proving that ^connected is not of finite order, our approach was 
first to prove this about a different (and simpler) predicate 
^one-in-a-box* Then we showed that ^connected could be us ed, on a 
certain subset of figures , to compute i^one-in-a-box : therefore its 
order must be at least as high. There are, of course, many other 
figures that ^connected wd l h ave to classify (in addition to those 
that contain all points of To in §5.1), but it was sufficient to study 
the predicates’ behavior just on this subclass of figures. 

We will use this idea again, many times, but the situation will be 
slightly more complicated. In the case just discussed, both predi- 
cates were defined on figures in the same retina, but in the sequel, 
we will often want to establish a relation between two predicates 
defined on different spaces. The flexibility to do this is established 
by the following simple theorem. 


5.4.1 The Collapsing Theorem 

This theorem will enable us to deduce limits on the order of a 
predicate \p on a set R from information about the order of a re- 
lated predicate \p on a set R. 

the j> root; ^ § s - 1 j 
+V\*r\ ah 



R 

X 


X 

F 

PARITY 

7 

! 

X 


( ve eich 


r CONNECTED 


v slav'd 

"Hv*. co II x iruj \~^eovevA\ We 
^\vt -fVu? -^ovvvial 

-fo v* COVH^U . 


Let F be a function that associates with any figure X in /?, a 
figure X = F{X) in R. Now let f be any predicate on R. This 





[78] 5.4 Geometric Theory of Linear Inequalities 


predicate defines a predicate \[/ on R by the computation 

ux) = knx)) = ux). 

Theorem 5.4.1: The order of \p is > the order of \p, provided that 
each point of R depends upon at most one point of R, in the sense 
that for each point x of R , either it has a constant value 

x e X for all X , or 
x i X for all X , 

or else there is a point x such that either 

\x e X] - \x e X] for all X , or 
\x e X] = \x f X] for all X 

proof : Suppose that has a realization of order K : 

\^OCi<fi > 0}. 

Then \p has a realization 

[2 «,-<?/ > 

where (Pi(X) is 0i{F{X)). To see that |5(^/) | < K , recall that 
<Pi depends on at most K points of R , and these in turn depend on 
at most K points of R. So <Pi(X) = <pi(F(X)) depends on at most 
K points of R. 

Example: A typical application of this construction is illustrated 
as follows (see Figure 5.3). The set R has three points (jcj, x 2 , x 3 ). 



Figure 5.3 


A Geometric Property with Unbounded Order 5.5 [79] 


The set R has 45 points. In the diagram, these fall into three 
classes: 8 points shown as white, 25 points shown as black, and 
12 points labeled x,- or x,-. F is defined in the following way: 
Given a set X, in R, F(X) must contain all the black squares, 
no white squares, the squares labeled x, only if x { tX, and the 
squares labeled x,- only if x-^X. 

5.5 Huffman’s Construction for ^connected 

We shall illustrate the application of the preceding concept by 
giving an alternative proof that ^connected has no finite order, 
based on a construction suggested to us by D. Huffman. 

The intuitive idea is to construct a switching network that will be 
connected if an odd number of its n switches are in the “on” 
position. Thus the connectedness problem is reduced to the parity 
problem. Such a network is shown in Figure 5.3 for n = 3. 
The interpretation of the symbols x, : and x, is as follows: when x,- 
is in the “on” position contact is made wherever x f appears, and 
broken wherever x, appears; when x, is in the “off” position 
contact is made where x,- appears and broken where x, appears. It 
is easy to see that the whole net is connected in the electrical and 
topological sense if the number of switches in the “on” position is 
1 or 3. The generalization to n is obvious: 

1. List the terms in the conjunctive normal form for ^ PAR , TY con- 
sidered as a point function, which in the present case can be 
written 

(x, Vi 2 V x 3 ) A (x, V x 2 V x 3 ) A (xj V x 2 V x 3 ) A (xi V x 2 V x 3 ) 


2. Translate this Boolean expression into a switching net by inter- 
preting conjunction as series coupling and disjunction as parallel 
coupling. 

3. Construct a perceptron which “looks at” the position of the 
switches. 

The reduction argument in intuitive form is as follows: the Huffman 
switching net can be regarded as defining a class F of geometric figures 
which are connected or not depending on the parity of a certain set, the 
set of switches that are in “on" position. We thus see how a perceptron 
for \ p connected on one set R can be used as a perceptron for \^ PAR 1TY on 
a second set R . As a perceptron for ^parity . it must be of order at least 



[80] 5.5 Geometric Theory of Linear Inequalities 


| R |. Thus the order of ’/'connected must be of order | R j. We can use 
the collapsing theorem §5.4.1 to formalize this argument. But before 
doing so note that a certain price will be paid for its intuitive simplicity: 
the set R is much bigger than the set R; in fact \R | must be of the order 
of magnitude of 21*1, so that the best result to be obtained from the 
construction is that the order of ’/'connected must increase as log \R\. 
This gives a weaker lower bound than was found in §5.1: log |/q com- 
pared with |/?| 1/3 . 

To apply the collapsing theorem we simply define the space R to 
be the three-point space R described at the end of §5.4. Then 
\ p parity on R is equal to ’/'connected on R for those figures obtained 
by applying F to figures on R . The collapsing theorem states that 
the order of ’/'parity is < the order of ’/'connected- 


5.6 Connectivity on a Toroidal Space | R | 

Our earliest attempts to prove that connectedness has unbounded 
order led to the following curious result: 

Theorem 5.6: The predicate ’/'connected on a 2/2 x 6 toroidally con- 
nected space has order > n. 

The proof is by construction: consider the space 





in which the edges e , e' and /, /' are considered to be identical 
(see also Figure 2.5). Now consider the family F of subsets of R 
that satisfy the conditions 

1. All the shaded points belong to each X e F, 

2. For each X e F and each /, either both points marked a t or both 
points bi are in F, but no other combinations are allowed. 

Then it can be seen, for each I e F, that X has either one con- 
nected component or X divides into two separate connected 



A Geometric Property with Unbounded Order 5.7 [81] 


figures. Which case actually occurs depends only on the parity of 
the number of a - s in X. Then using the collapsing theorem and 
Theorem §3.1.1, we find that ^ CONNECXED has order > 

The idea for Theorem 5.6 came from the attempt to reduce connectivity 
to parity directly by representing the switching diagram shown in Figure 
5.4. If an even number of switches are in the “down” position then x is 
connected to x' and y to y' . If the number of down switches is odd, x is 
connected to y' and x' to y. This diagram can be drawn in the plane by 
bringing the vertical connections around the end (see Figure 5.11); then 
one finds that the predicate \x is connected to x'l has for order some 



constant multiple of \R\ l/2 . If we put the toroidal topology on P, the 
order is known (§5.6) to be greater than a constant fraction of |/?|; this 
is also true for a 3-dimensional Euclidean R. These facts strongly suggest 
that our bound for the order of ^connected is too low f° r the plane case. 

5.7 A Better Bound for ^connected * n the Plane 

The following construction shows that the order of ^connected is 
> const *{\R \ 1//2 ) for two-dimensional figures. It results from 
modifying Figure 5.4 so as to connect x to x r . This is easy for 
the torus, but for a long time we thought it was impossible in the 
plane. 

We first define a “4-switch” to be the pair of figures 



Figure 5.5 



[82] 5.7 Geometric Theory of Linear Inequalities 


In the down state, one can see that 
Pi is connected to q [i+ 1)4 , 

where (y) 4 is the remainder when j is divided by 4. In the up 
state, we have 

Pi is connected to < 7 (/ _ i )4 . 

Now consider the effect of cascading n such switches, as shown in 
Figure 5.6. 



This simply iterates the effect: in fact, if d switches are down and 
u switches are up , we have 

Pi is connected to q {i+c j^ u ) 4 

for all i. Now, since every switch is either up or down , 
d + u = n, hence 

q (i + d ~ u)4 = q(i + 2d-n)4i 

and we notice that this depends only upon the parity of dr For" 

Next, we add fixed connections that tie together the terminals 
< 7(i-«)4'» *7(2-«)4 i and <7(3 -«) 4 * 

Then if d is even, p u p 2 , p 3 are tied together 
while if d is odd, /? 3 , /? 0 , p { are tied together. 


A Geometric Property with Unbounded Order 5.7 [83] 


In each case p x and p 2 are connected, so we can ignore, say, /? 3 . 
So the connectivity of the system has just two states, depending 
on the parity of the number of switches in down position, and 
these states can be represented as shown below. 



Figure 5.7 


To prove our theorem we simply tie p x and p 2 together. 



Figure 5.8 


It remains only to realize the details of the “4-switches.” Figure 
5.9 illustrates the two configurations. 



Figure 5.9 



[84] 5.7 Geometric Theory of Linear Inequalities 


Remember thatjy is not a connection. When the entire construc- 
tion is completed for n switches, the network will be about 5 n 
squares long and about 2n + 12 squares high, so that the number 
of switches can grow proportionally to |/?| l/2 . It follows that the 
order of ^connected grows at least as fast as | jR | l/2 . Figure 5.10 
illustrates the complete construction. 



One must verify that there remain no “stray” connection lines 
that are not attached eventually to p 0 , p i, or p 2 . This can be 
verified by inspection of Figure 5.6. Furthermore, no closed loops 
are formed, other than the one indicated in left-hand part of 
Figure 5.8. 



Figure 5.1 1 



A Geometric Property with Unbounded Order 5.8 [85] 


The idea for Theorem 5.7 comes from observing that in the planar 
version of Figure 5.4 (see Figure 5.1 1) we have p\ ++ q\ and p 2 ++ q 2 f 0 r 
one parity and p\ q 2 and p 2 ++ q\ for the other. If we could make a 
permanent additional direct connection from p\ to q\ then the whole net 
would be connected or disconnected according to the parity. But this is 
topologically impossible, and because the construction appeared incom- 
pleteable we took the long route through proving and applying the one- 
in-a-box theorem. Only later did we realize that the p \ «-► q { connection 
could be made “dynamically,” if not directly, by the construction in 
Figure 5.10. 


5.7.2 The Order of ^ CONNECTED as a Function of \R\ 

What is the true order? Let us recall that at the root of the proof 
methods we used was the device (§5.0) of considering not all the 
figures but only special subclasses with special combinatorial 
features. Thus even the order of §5.6 is only a lower 

bound. Our suspicion is that the order cannot be less than \ |/?|. 
As for the number of (p's required, Theorem 3.1.2 and the toroidal 
results give us > 2 |/?l/l2 , but this too, is only a lower bound, and 
one suspects that nearly all the masks are needed. Another line of 
thought suggests that one might get by with an order like the 
logarithm of the number of connected figures, but that has prob- 
ably not much smaller an exponent. 

Examination of the toroidal construction in §5.6 might make one 
suspect that the result, ^connected > A 1^1 is an artifact resulting 
from the use of a long, thin torus. Indeed, for a “square” torus 
we could not get this result because of the area that would be 
covered by the connecting bridge lines. This clouds the conclusion 
a little. On the other hand, if we consider a //zree-dimensional R, 
then there is absolutely no difficulty in showing that ^connected > 
(1 /K) |/?|, for some moderate value of K. It is hard to believe 
that the difference in dimension could matter very much. 


5.8 Topological Predicates 

We have seen that \X is connected] is not of finite order and we 
shall see soon that \X contains a hole] is also not of finite order. 
Curiously enough the predicate — 

fJA yo'iie 'hvCy Iwlr 

[^ - is connecte d — eft - — X conta i ns a hole} — see f 



[86] [5.8] Geometric Theory of Linear Inequalities 


- h - as finite order , e v en though neither disjunct - - does an instanc e— 
-e f the opposite of the And/Q r phcnomcnor rrThis will be shown 
by a construction involving the Euler relation for orientable 
geometric figures. 

5.8.1 The Euler Polygon Formula 

Two-dimensional figures have a topological invariant* that in 
polygonal cases is given by 

B(X) = | faces (X) | - ledges (X)\ + | vertices (X)\. 

The sums under the examples in Figure 5.12 illustrate this formula 
by counting the number of faces, edges, and vertices, respectively. 
Use of the formula presupposes that a figure is cut into sufficiently 
small pieces that each “face” be simple — that is, contain no holes. 

It is a remarkable fact that B(X) will have the same value for any 
dissection of X that meets this condition. 


G--1 

1-0 

l-o 

M 

2-1 

m m 

i 

2-1 

Ml 

1 


O-l +2 

l-<f-+V- 

2-ll + JO 

Z-M+10 

Or * 0 

l-l 

□ 

0-M- 

l-l 

an 

1 -7+6 

2-2 

n a 

0-7+7 


A 

2-2 

0-7+7 

G* = -1 

1-2 

m 

0-7+6 

2-3 

2-3 

n ra 

o-n-h/o 


! 

o-n + fo 

,tn; 

O-IH-IO 

Figure 5. 

12 



r 


In our context of figures made up of checkerboard squares, B(X) 
can be computed by a low-order linear sum G(X) defined as 
follows: 

G{X) = 2a/X/ + '2,a i jX i Xj + '2a ijk ix i x j x k xi, where 


*For our purposes here, a “topological invariant” is nothing more than a predi- 
cate that is unchanged when the figure is distorted without changing connected- 
ness or inside-outside relations of its parts. 



A Geometric Property with Unbounded Order 5.8 [87] 


a l = 1 
a ij = ~ 1 
OCijkl = 1 


for each point H of R , 


for each adjacent pair 


for each square 



vertices 

edges 

faces 


G(X) and B(X) exactly agree on checkerboard figures without 
corner contacts like . When they disagree in such cases, our 
definition of connectedness requires the value of G{X). 

The importance of G(X) in our theory lies in the fact that al- 
though it is highly local — in fact, diameter limited and finite 
order — it is equivalent to the global formula 


E(X) = \components (X) \ - \holes(X)\. 


A component of a figure is the set of all points connected to a 
given point. 

A hole of a figure is a component of the complement of a figure. 
We assume that a figure is surrounded by an “outside” that does 
not count as a hole. Also, we have to define “corner contact” to 
be a connection, when dealing with a figure’s complement. 

Now we will prove that the local formula G(X) and the global 
formula E(X) are equivalent. First we will give a rather direct 
demonstration. Then in §5.8.2 we will give another kind of proof, 
based on deforming one figure into another, that will give a better 
insight into the proof of the main theorem of §5.9. 

Any figure X can be obtained by beginning with a one-square 
figure and adding squares one at a time. For a single square we 
have 

G(X) = E(X) = 1. 

Adding a square that is not adjacent to any square already in X 
adds 1 to G(X ), and (since it is a new component!) adds 1 to 
E(X). 

Adding a square adjacent to exactly one other square cannot 
change E(X), and adds exactly 0 - 1 + 1 = 0 to G(X). 



[88] 5.8 Geometric Theory of Linear Inequalities 


Three kinds of things can happen when one adds a square adja- 
cent to two others. When the new square fills in a corner, as in 



then 1 — 2 + 1 = 0 is added to G, leaving it unchanged, and 
neither is E(X) changed in this case. But when the new square 
connects two others that were not already connected together, 
as in 



then there is a net decrease of 1-2 + 0= -1 in G together 
with a decrease in E(X), because we have joined two previously 
separated parts. Finally, if the added square connects two squares 
that are already connected by some remote path, as in, 



then a region of space is cut off — a hole is formed, decreasing 
E by 1 and again the change in G is 1-2 + 0= -1. Case 
analyses of the 3-neighbor and 4-neighbor situations complete 
the proof: these include partial fills like 




which add 1 - 3 + 2 = 0 and 1 - 4 + 4 = 1. Notice that in the 
latter case, G is increased by one unit, as the hole is finally 
filled-in. In each case either G was unchanged , or the topology 
of the figure X was changed. (All this corresponds to an argument 
in algebraic topology concerning addition of edges and cells to 
chain-complexes.) This proves 

Theorem 5.8.1: E{ X) = G{X). It follows immediately that the 
predicate fG^) < n] is realized with order < 4. This leads to 
some curious observations: If we are given that the figures X are 
restricted to the connected (= one-component) figures then an 


A Geometric Property with Unbounded Order 5.8 [89] 


order-4 machine can recognize 
[A" has no holes] = \G(X) > 01 
or 

r has less than 3 holes] = \G{X) > - 21. 

that there are nc? vJe, cah Yeco<j»n * Vcoumcfctfo 

But of course we cannot conclude that these can be recognized 
unconditionally by a finite-order perceptron. 

Note that this topological invariant is thus seen to be highly 
“local” in nature — indeed all the (p's satisfy a very tight diameter- 
limitation! Now returning to our initial claim we note that 

\G(X) = n] s (f G(X) < n] s \G{X) > nl). 

By Theorem 1.5.4 we can conclude that \G(X) = N] has order < 

8. But the proof of that theorem involves constructing product-^’s 
that are not diameter-limited, and we show §8.4 that this predi- 
cate cannot be realized by diameter-limited perceptrons. 

5.8.2 Deformation of Figures into Standard Forms 

The proof of Theorem §5.8.1 shows that G(X) will take the same 
values on any two figures X and Y that have the same value of 
E = |components | - |holes|. Now we will show that one can 
make a sequence of figures X , . . . , X ,-, . . . , T, each differing from 
its predecessor in one locality, and all having the same values for 
G = E. It is easy to see how to deform figures “smoothly” with- 
out changing G or E, in fact, without changing holes or compo- 
nents. For example, the sequence 



can be used to enlarge a hole. Now we observe that if a com- 
ponent C 0 lies within a hole H x of another component C 1? then 
C 0 can be moved to the outside without changing E{X) or G{X). 
Suppose, for simplicity, that C x touches the “outside” and that 
C 0 is “simply” in H x \ that is, there is no component C also 
enclosing C 0 



[90] 5.8 Geometric Theory of Linear Inequalities 



Then C 0 can be removed from //i by a series of deformations in 
which, first, H\ is drawn to the periphery 



and then C 0 is temporarily attached: 



Notice that this does not change the value of G(X). Also, since it 
reduces both C and H by unity, it does not change E(X) = 
C{X) - H(X). 

We can then swing C\ around to the outside 


A Geometric Property with Unbounded Order 5.8 [91] 


and reconnect to obtain 



again without changing G(X) or E(X ). Clearly, we can eventually 
clear out all holes, by repeating this on each component as it 
comes to the outside. When this is done, we will have some num- 
ber of components, each of which may have some empty holes, 
and they can be all deformed into standard types of figures like 



Now by reversing the previous operation that took us from step 6 
to 7, we can fuse any component that has a hole with any other 
component, for example, 



and thus one can reduce simultaneously both C and H until H is 
zero or C is unity. At this point one has either 



— n > 

components 


or 



- m - 
holes 


In each case one can verify that 
G(X) = E(X) = n 



[92] 5.9 Geometric Theory of Linear Inequalities 


or 

G(X) = E(X) = 1 - m. 

We will apply this well-known result in the next section. 

5.9 Topological Limitations of Perceptrons 

Theorem 5.9: The only topologically invariant predicates of finite 
order are functions of the Euler number E(X) . 

The authors had proved the corresponding theorem for the diameter- 
limited perceptron, and conjectured that it held also for the order-limited 
case but were unable to prove it. It was finally established by Michael 
Paterson, and §5.9. 1 is based upon his idea. 

5.9.1 Filling Holes 

Suppose that C(X) > 2 and H(X) > 1. 

Choose a hole H 0 in a component C 0 . Let Ci be a component 
“accessible” to C 0 , that is, there is a path P 0] from a boundary 
point of C 0 to a boundary point of C\ that does not touch X. 
Let Poo be a path within C 0 from a point on the boundary of hole 
H 0 to a point on another boundary of C 0 , such that poo and p 0 \ 
are connected. 



This is always possible even if C\ is within // 0 , outside C 0 com- 
pletely, or within some other hole in C 0 . 

Now, if \p(X) is a topologically invariant predicate, its value will 
not be changed when we deform the configuration in the follow- 
ing way: 



A Geometric Property with Unbounded Order 5.9 [93] 



Finally suppose that we were permitted to change the connec- 
tions in the box, to 



In effect, this would cut along P oo, removing a hole, and connect 
across one side of jP 01 , reducing by unity the number of com- 
ponents. Thus it would leave E(X) unchanged. 



[94] 5.9 Geometric Theory of Linear Inequalities 


Now we will show that making this change cannot affect the value 
of \p ! 

Suppose that \p has order k. Deform the figure {X) until the box 
contains a cascade of k + 1 “4-switches.” (See Figure 5.6.) This 
does not change the topology, so it leaves \p unchanged. Now 
consider the 2 k + ] variants of X obtained by the 2 k+{ states of the 
cascade switch. If \p has the same value for all of these, then we 
obviously can make the change, trivially, without affecting \p. If 
two of these give different values, say \p(X ') ^ \p(X "), then these 
must correspond to different parities of the switches, because \p is 
a topological invariant. But if this is so, then \p must be able to 
tell the parity of the switch, since all X's of a given parity class 
are topological equivalents (see details in §5.7). But because of 
the collapsing theorem, we know that this cannot be: \p must 
become “confused” on a parity problem of order > k. Therefore 
all figures obtained by changing the switches give the same value 
for \p, so we can apply the transformations described in §5.8.2 
without changing the values of \p. 

5.9.2 The Canonical Form 

We can use the method of §5.9.1 and §5.8.2 to convert an arbi- 
trary figure X to a canonical form that depends only upon E(X) 
as follows: we simply repeat it until we run out of holes or com- 
ponents. At this point there must remain either 

1 . A single component with one or more holes, or 

2. One or more simple solid components according to whether or 
not E(X) < 0. 

In case 1 the final figure is topologically equivalent to a figure like 



« m > 

holes 

with 1 — E(X) holes, while in case 2 the final figure is equivalent 
to one like 



components 


A Geometric Property with Unbounded Order 5.9 [95] 


with E(X) solid squares. Clearly, then, any two figures X and X' 
that have E(X) = E(X') must also have \p(X) = \p(X'). This com- 
pletes the proof of Theorem 5.9 which says that \p(X) must 
depend only upon E(X). 

remark: There is one exception to Theorem 5.9 because the canonical 
form does not include the case of the all-blank picture! For the predicate 

\X is nonempty] 

is a topological invariant but is not a function of E(X)\ See §8.1.1 
and §8.4. 

There are many other topological invariants besides the number of com- 
ponents of X , and G(T), for example, 

[a component of X lies within a hole within another component of X}. 

The theorem thus includes the corollary that no finite-order predicate 
can distinguish between figures that contain others (left below) and those 
(right below) that don’t. 


Problem: What are the extensions of this analysis to topological 
predicates in higher dimensions? Is there an interpretation of 
2a, <^, as a cochain on a simplicial complex, in which the thresh- 
old operation has some respectably useful meaning? 


/hat is composed onltj Closzd CWv/es is 

Conjonc.fi v/e,(vj \oc* l ^ cJ/qwietev~ Inn ife<f is i tin 
vicco of tU*. etlftfiifu ot uoumo r.Li IdisriA 4-?> rleof reotonal 


It 

nc? 
not* 


u/cu UJ 

Hat 

Hot 


r cluUvcyi to deal rec&onahly 

fttl £acU y \cal concepts. For tin's sitoujS 

o ’’-fopolo <j i cal j owe 

te fk* occ orawce <4- ends or break’s. 


&CX) - 6-ra,^) 



Geometric Patterns of Small Order: 

Spectra and Context 


6 


6.0 Introduction to Chapters 6 and 7 

In Chapters 6 and 7 we study predicates that are more strictly 
geometric than is connectedness. An example typical of the prob- 
lems discussed is to recognize all translations of a particular figure 
or class of figures. In one sense the results are more positive than 
those of the previous chapter. Many such problems can be solved 
using low-order perceptrons, and the two chapters will be 
organized around two techniques for constructing geometrical 
predicates whose orders are often surprisingly small. 

The technical content of this introduction will be partly incom- 
prehensible until after Chapter 7 is read. It is intended, if read in 
the proper spirit, to provide an atmosphere enveloping this series 
of results and observations with a certain coherence. 

Whenever we can apply the group invariance theorem, the study 
of invariant predicates of small order reduces to the study of a few 
kinds of elementary local predicates. The bigger the group, the 
smaller and simpler becomes this set of elementary predicates. 
Because ^ PAR , TY is invariant under the biggest possible group 
(namely, all permutations) we were able to use for the elementary 
predicates the simple masks, classified merely according to the 
sizes of their support sets. Interesting geometric predicates will 
not survive such drastic transformations. Groups such as transla- 
tions or general rigid motions , lead to more numerous equiva- 
lence-types of partial predicates. Figures satisfying invariant 
predicates will nevertheless be characterized entirely by the sets of 
numbers which tell us how many of each type of partial predicate 
they satisfy. We shall call these sets spectra and show in Chapter 6 
how to use them. 

Chapter 7 will center around a very different technique for con- 
structing geometric predicates. Whenever the group can be ordered 
in an appropriate way we can stratify the set of figures equivalent, 
under the group, to a given one, by the rank order of the group 
element necessary to effect the transformation. We can thus (in 
many interesting cases) split the recognition problem into two 
parts: recognize the stratum to which the figure belongs and then 
apply a simple test appropriate to the stratum. This description 
has an air of serial rather than parallel computation and, indeed, 
part of its interest is that it shows at least one way of simulating 
a serial, or conditional , operation using a parallel procedure. 



Geometric Patterns of Small Order 6.0 [97] 


Naturally a price has to be paid for this simulation. Our method 
of achieving it leads to extremely large coefficients in the linear 
representations obtained. Taken in itself this does not exclude the 
possibility of some other procedure achieving the same result 
more cheaply. We are therefore led (in Chapter 10) to a new area 
of study — the bounds on coefficient sizes — and to some in- 
triguing, though as yet only partially understood, results. 

We recall that our proof of the group-invariance theorem as- 
sumed that the group was finite. The ordering we use in strati- 
fication assumes that the group is infinite: for example, the 
translations on the infinite plane are ordered in the obvious way, 
but this becomes impossible if the group is made finitely cyclic by 
the toroidal construction described in §5.6. When we first ran into 
this conflict, the techniques of stratification and those related to 
group invariance (spectra, etc.) seemed to be strictly disjoint areas 
of research. But further study brought them together in a possibly 
deep way. We can in fact rescue the group-invariance theorem in 
some infinite cases by assuming that the coefficients are bounded. 
For example, suppose that \p(X) is a predicate defined for finite 
figures, X , on the infinite plane and is invariant under the group of 
translations. Then it can be expressed as an infinite linear form, 
for example, 


UX) 


'Ea^(X) > e 


where <J> is an infinite set (for example, the masks) chosen so that 
for any finite X all but a finite number of terms in the sum will 
vanish. Now, if we know that the a ^ are bounded we can use 
(by Theorem 10.4.1) the group-invariance theorem. In some par- 
ticular cases this yields an order greater than that obtained by 
stratification. The contradiction can be dissipated only by con- 
cluding that the coefficients a ^ cannot be bounded for any low- 
order representative. It follows that the largeness of the coeffi- 
cients produced by our stratification procedure is not merely an 
accidental result of an inept algorithm (though, of course, the 
actual values might be; we have not shown that they are minimal). 

We were, of course, delighted to find that what seemed at first 
to be a limitation of our favorite theorem could be used to yield 
a valuable result. But our feeling that the situation is deeply in- 



[98] 6.0 Geometric Theory of Linear Inequalities 


teresting goes beyond this immediate (and practical) problem of 
sizes of coefficients. It really comes from the intrusion of the 
global structure of the transformation group. For a long time we 
believed that the recognition of all translations of a given figure 
was a high-order problem. Stratification showed we were wrong. 
But we have not been able to find low-order predicates for the 
corresponding problem when the group contains large finite cyclic 
subgroups such as rotations or translations on the torus, and 
we continue to entertain the conjecture that these are not finite- 
order problems. 

Complementing the positive results of Chapter 6 will be found 
one negative theorem of considerable practical interest. This con- 
cerns the recognition of figures in context. It is easy to decide, 
using a low-order predicate, whether a given figure is, say, a 
rectangle. The new kind of problem is to decide whether the figure 
contains a rectangle and, perhaps, something else as well (see 
Figure 6.1). 



It seems obvious that recognition-in-context should be somewhat 
harder, perhaps requiring a somewhat higher order. We shall 
show (§6.6) that it is worse than that: it is not even of finite 
order! 

Finally it should be noted that we manage, once more, to avoid 
the need to use a tolerance theory to escape from the limitations 
of our square-grid arrays. The translation group does not raise 
this problem. The rotation group does; but we say all we have to 
say in the context of 90° rotations. The similarity group suggests 
the most serious difficulties: one can dilate a figure easily enough, 
but how can one contract a small one? As it happens we have 
nothing interesting to say about this group. We urge future 
workers to be less cowardly. 




Geometric Patterns of Small Order 6.2 [99] 


In §6. 1 — §6.4 we begin by showing that certain patterns have 
orders = 1, = 2, < 3, < 4, respectively. In most cases we usually 
have not established the lower bound on the orders and have no 

systematic methods for doing so. 

6.1 Geometric Patterns of Order 1 

When we say “geometric property” we mean something invariant 
under translation, usually also invariant under rotation, and often 
invariant under dilation. The first two invariances combine to 
define the “congruence” group of transformations, and all three 
treat alike the figures that are “similar” in Euclidean geometry. 
For order 1 we know that coefficients can be assumed to be 
equal.* Therefore, the only patterns that can be of order 1 are 
those defined by a single cut in the cardinality or area of the set: 

4/ = r \X\ > A] or i = \ \X\ < A 1. 

Note: If translation invariance is not required, then perceptrons of 
order 1 can, of course, compute other properties, for example, 
concerning moments about particular points or axes. (See §2.4. 1 .) 
However, these are not “geometric” in the sense of being suitably 
invariant. So while they may be of considerable practical im- 
portance, we will not discuss them further. t 

6.2 Patterns of Order 2, Distance Spectra 

For k = 2 things are more complicated. As shown in §1.4, Ex- 
ample 3, it is possible to define a double cut or segment 
\Ai < A < A 2 1, in the area of the set and recognize the figures 
whose areas satisfy 

4 = Ml < \X\ < All 

In fact, in general we can always find a function of order 2k that 
recognizes the sets whose areas lie in any of k intervals. But let us 


* All the theorems of this chapter assume that the group invariance theorem can 
be applied, even though the translation group is not finite. This can be shown to 
be true if (Theorem 10.4) the coefficients are bounded; it can be shown in any case 
for order 1, and there are all sorts of other conditions that can be sufficient. 
In §7.10 we see that the group-invariance theorem is not always available. We 
do not have a good general test for its applicability. Of course, the coefficients 
will be bounded in any physical machine! 

tSee, for example, Pitts and McCulloch [1947] for an eye-centering servo- 
mechanism — using an essentially order- 1 predicate. 



[100] 6.2 Geometric Theory of Linear Inequalities 


return to patterns with geometric significance. First, consider only 
the group of translations, and masks of order 2. Then two masks 
X\X 2 and x\ x{ are equivalent if and only if the difference vectors 

x , —^ 2 and *i - x'l 


are equal, with same or opposite sign. Thus, with respect to the 
translation group, any order-2 predicate can depend only on a 
figure’s “difference-vector spectrum,” defined as the sequence of 



have the same difference-vector spectra, that is, 


“vector” 


number of pairs 
4 





1 

2 

1 

1 


1 


Hence no order-2 predicate can make a classification which is 
both translation invariant and separates these two figures. In fact, 
an immediate consequence of the group-invariance theorem is: 

Theor em 6.2: Let \p(X) be a translation-invariant predicate of 
order 2. Define n„(X) to be the number of pairs of points in X that 
are separated by the vector v. Then \p(X) can be written 




Geometric Patterns of Small Order 6.2 [101] 


UX) 


yi a v n v (X) > 0 


proof: n v predicates in the class are satisfied by any trans- 
lation of X. By Theorem 2.3 they can all be assigned the same 
coefficient. 

Corollary: Two figures with the same translation spectrum n(v) 
cannot be distinguished by a translation-invariant order-2 percep- 
tron. (But see footnote to §6.1 .) 

Conversely if the spectra are different, for example n V{ (A) < 
then the translations of two figures can be separated by 
\n Vx {X) < n, x (B)]. But classes made of different figures may not 
be so separable. 

Example: the figures 



are indistinguishable by order-2 predicates, while 



have different difference-vector spectra and can be separated. If 
we add the requirement of invariance under rotations, the latter 
pair above becomes indistinguishable, because the equivalence- 
classes now group together all differences of the same length, 
whatever their orientation. 

Note that we did not allow reflections, yet these reflectionally 
opposite figures are now confused! One should be cautious about 
using “intuition” here. The theory of general rotational in- 
variance requires careful attention to the effect of the discrete 
retinal approximation, but could presumably be made consistent 
by a suitable tolerance theory; for the dilation “group,” there 
are serious difficulties. (For the group generated by the 90° rota- 
tions, the example above fails but the following example works.) 



[102] 6.2 Geometric Theory of Linear Inequalities 


An interesting pair of figures rotationally distinct,, but neverthe- 
less indistinguishable for k = 2, is the pair 


and 

which have the same (direction-independent) distance-between- 
point-pair spectra through order 2, namely, 

| Xi - Xj | =1 from 4 pairs 

| Xi - Xj | = a/ 2 from 2 pairs 

\x( - Xj | =2 from 2 pairs 
I*/ - Xj | = a/ 5 from 2 pairs 

and each has 5 points (the order-1 spectrum). 

The group-invariance theorem, §2.3, tells us that any group- 
invariant perceptron must depend only on a pattern’s “occupancy 
numbers,” that is, exactly the “geometric spectra” discussed here. 
Many other proposals for “pattern-recognition machines” — not 
perceptrons, and accordingly not representable simply as linear 
forms — might also be better understood after exploration of their 
relation to the theory of these geometric spectra. But it seems 
unlikely that this kind of analysis would bring a great deal to the 
study of the more “descriptive” or, as they are sometimes called, 
“syntactic” scene-analysis systems that the authors secretly advo- 
cate. 

Another example of an order-2 predicate is 

\X lies within a row or column and has < n segments] 

which can be defined by 

[2 a + 2 a - 2 a + (all non-collinear pairs) < n], 

EM 1 jiff ESS 






Geometric Patterns of Small Order 6.3 [103] 


6.3 Patterns of Order 3 


6.3.1 Convexity 

A particularly interesting predicate is 

i/' convex 00 = [A" is a single, solid convex figurel. 

That this is of order < 3 can be seen from the definition of “con- 
vex”: X is convex if and only if every line-segment whose end 
points are in X lies entirely within X. It follows that X is convex 
if and only if 

a e X and be X midpoint ([a,b]) e X; 
hence 




CONVEX 


(X) 


[midpoint [a,b] not in X] < 1 

all a, b in X 


has order < 3 and presumably order = 3. This is a “conjunc- 
tively local” condition of the kind discussed in §0.2. 

Note that if a connected figure is not convex one can further con- 
clude that it has at least one “local” concavity, as in 



with the three points arbitrarily close together. Thus, if we are 
given that X is connected, then convexity can be realized as 
diameter-limited and order 3. If we are not sure X is connected, 
then the preceding argument fails in the diameter-limited case 
because a pair of convex figures, widely separated, will be ac- 
cepted: 


[104] 6.3 Geometric Theory of Linear Inequalities 



Indeed, convexity is probably not order 3 under the additional 
restriction of diameter limitation, but one should not jump to the 
conclusion that it is not diameter-limited of any order, because of 
the following “practical” consideration: 

Even given that a figure is connected, its “convexity” can be de- 
fined only relative to a precision of tolerance. In addition figures 
must be uniformly bounded in size, or else the small local toler- 
ance becomes globally disastrous. But within this constraint, one 
can approximate an estimate of curvature, and define “convex” to 
be /(curvature) < 4t. We will discuss this further in §8.3 and 
§9.9. 


6.3.2 Rectangles 



Figure 6.2 Some “hollow” rectangles. 



Within our square-array formulation, we can define with order 3 
the set of solid axis-parallel rectangles. This can even be done 
with diameter-limited <p' s, by 




^ Vrm — 41 


where all <£>’s equivalent under 90° rotation are included. The 
hollow rectangles are caught by 



< 



Geometric Patterns of Small Order 6.3 [105] 


where the coefficients are chosen to exclude the case of two or 
more separate points. These examples are admittedly weakened 
by their dependence on the chosen square lattice, but they have an 
underlying validity in that the figures in question are definable 
in terms of being rectilinear with not more than four corners, and 
we will discuss this slightly more than “conjunctively local” kind 
of definition in Chapter 8. 

One would suppose that the sets of hollow and solid squares 
would have to be of order 4 or higher, because the comparison of 
side-lengths should require at least that. It is surprising, therefore, 
to find they have order 3. The construction is distinctly not con- 
junctively local, and we will postpone it to Chapter 7. 

63.3 Higher-order Translation Spectra 

If we define the 3-vector spectrum of a figure to be the set of 
numbers of 3-point masks satisfied in each translation-equiva- 
lence class, it is interesting to note the following fact (which is 
about geometry, and not about linear separation). 

Theorem 6.3.3: Figures are uniquely characterized (up to trans la- 
tion) by their 3-vector spectra, even in higher dimensions. 



Figure 6.3 



[106] 6.3 Geometric Theory of Linear Inequalities 

The proo-f shows "Hie IcwcjesT vectors 

•Hvarr caa be ivrsc\f i bed! \n ^ ^i^uve. a. ire 
u/u^oc in &<xck cLi\recha/i: no of he\r \iec-bo\r 
can. be pav<3.\(el 4 <mc! e^o<il in c*_ 

l onc^es't \/ecf(?r 

proof: Let X be a particular figure. The figure A" has a maximal 
distance D between two of its points. Choose a pair ( a,b ) of points 
of X with this distance and consider the set <b ab = \<p a .b,x\ of masks 
of support 3 that contain a,b and any other point x of X. Each 
such mask must have coefficient equal to unity in the translation 
spectrum of X , for if X contained two translation-equivalent 
masks 

- *P-t a>.. r- ^fld V * 1 gT 

then one of the distances [a,gb] or [ ga,b ] would exceed Z), for they 
are diagonals of a parallelogram with one side equal to D (see 
Figure 6.3). 

Thus any translation of X must contain a unique translation of 
( a,b ) and the part of its spectrum corresponding to allows one 

to reconstruct X completely (see Figure 6.4). 

b 



Figure 6.4 

The fact that a figure is determined by its 3-translation spectrum 
does not, of course, imply that recognition of classes of figures is 
order 3. (It does imply that the translations of two different 
figures can be so separated. In fact, the method of §7.3, Applica- 
tion 7, shows this can be done with order 2, but only outside 
the bounded-coefficient restriction.) 

6.4 Patterns of Order 4 and Higher 

We can use the fact that any three points determine a circle to 
make an order-4 perceptron for the predicate 

\X is the perimeter of a complete circle] 



Geometric Patterns of Small Order 6.5 [107] 


by using the form 

X X a x b x c x d -f X x a x b x c x d < 1 , 

di^abc de C abc 

where C abc is the circle* through x a , x b , and x c . Many other 
curious and interesting predicates can be shown by similar argu- 
ments to have small orders. One should be careful not to conclude 
that there are practical consequences of this, unless one is pre- 
pared to face the facts that 

1. Large numbers of (p's may be required, of the order of 
for the examples given above. 

2. The threshold conditions are sharp, so that engineering con- 
siderations may cause difficulties in realizing the linear summa- 
tion, especially if there is any problem of noise. Even with simple 
square-root noise, for k = 3 or larger the noise grows faster than 
the retinal size. The coefficient sizes are often fatally large, as 
shown in Chapter 10. 

3. A very slight change in the pattern definition! often destroys 
its order of recognizability. With low orders, it may not be pos- 
sible to define tolerances for reasonable performance. 


6.5 Spectral Recognition Theorems 

A number of the preceding examples are special cases of the 
following theorems. (The ideas introduced here are not used 
later.) The group-invariance theorem (§2.3) shows that if a predi- 
cate \p is invariant with respect to a group G, then if \p e L(<£) 
for some it can be realized by a form 


* = 


Z AW) > 0 


where N, is the number of (p's satisfied by X in the /th equivalence 
class. In §5.2 we touched on the “difference vector spectrum” 
for geometric figures under the group of translations of the plane. 


*Again there is a tolerance problem: what is a circle in the discrete retina? 
See §8.3. 

fOur formula accepts (appropriately) zero- and one-dimensional “circles.” This 
phenomenon cannot be avoided, in any dimension, by a conjunctively local 
predicate. 



[108] 6.5 Geometric Theory of Linear Inequalities 


Those spectra are in fact the numbers N f (X) for order 0, 1, and 2. 
If a G-invariant \p cannot be described for any condition on the 
N-s for a given <i>, then obviously \p is not in L(<I>). The following 
results show some conditions on the N f that imply that \p is of 
finite order. 

Suppose that \p is defined by m simultaneous equalities: 


WX) s \N\(X) = /z, and N 2 (X) = n 2 and ... N m {X) = #ij. 


where n u . . . , n m is a finite sequence of integers. The order of \p 
is not more than twice the maximum of the orders of the (p's asso- 
ciated with the A/'/’s. We will state this more precisely as 

Theorem 6.5: Let 


* = *1 U.$2 U ••• U 

and 


N,(X) = \{<p \ <p 6 $ , AND <p(X) = 1}| = X *(*)• 


Then the order of 


\p(X) = \Ni(X) = n { , for 1 < i < m] 
is at most twice 
max {|S(y)|; <pe$ j. 

The goal of the proof is to show that the definition of \p can be 
put in the form of a linear threshold expression, namely, 

WX) = [2(W) - n,) 1 < 11. 

As it stands this is not a linear threshold combination of predi- 
cates. To recast it into the desired shape we introduce an ad hoc 
convention that will not be used elsewhere. Given any set <I> of 
predicates we construct a new set of predicates $ 2 by listing 
all pairs of (<£>,, (pj) of predicates in # and defining 


<Pij(X) = (PXX) a <pj(X). 



Geometric Patterns of Small Order 6.5 [109] 


Many of the predicates so constructed will be logically equivalent, 
for example, = <p y7 , but we make the convention that these 
are to be counted as distinct members of <J> 2 . (This means that in 
a very strict sense 3> 2 is a set of “predicate forms” rather than of 
predicates.) 

The effect of the convention is to simplify the arithmetic and logic 
of the counting argument we are about to use. Let L be a figure 
for which exactly N predicates in are satisfied. Obviously N 2 
predicates of $ 2 will be satisfied by X , that is, 

Z = N 2 . 

$2 


Now let <i>i, <E> 2 , . . . , be an enumeration of the equivalence classes 
of Since the number of predicates of <£, satisfied by X is 


n,{X) = Z #>(*); 

if- 

then, as we have seen, 

Z <P(X) = N}{X). 

Thus 


Z Z *oo - <p(x) + /i? > = Z iwoo - «/) 


To represent the left-hand side of this equation in the standard 
linear threshold predicate form we define <£' = <J> 2 U $ U {the 
constant predicate}, and write 


*00 


Z«(*m*) < i 


where 


£*(<£>) = 1 for (p e <£ 2 
a(<p) = — 2 / 2 / for <pe$i 
a (constant) = Z« 2 . 



[110] 6.5 Geometric Theory of Linear Inequalities 


To complete the proof of the theorem we have only to observe 
that 

|S(*v)l = |S(*,) U S(^)| 

< 1*5 (^/) I + l*s , (^)l 

< 2 (max |S(^)|) Q.E.D. 

6.5.1 Extended Exact Matching 

An obvious generalization of Theorem 6.5 is this: Suppose that 
\p is defined by 

n m 

H*) ■ VA (Nj(X) = n tJ ) 9 

f-iy-i 

that is, \p satisfies any one of a number of exact conditions on the 
TV,-. Then \p is of finite order, for we can realize the polynomial 
form 

n m 

n e w - "u ) 2 

i = 1 j= 1 

by methods like those in the previous paragraph. However, the 
extension now requires Boolean products of predicates of differ- 
ent equivalence classes, and the maximal order required will be 
<2 n • max |£(v?)|. 

Note that if one were not aware of the And/Or phenomenon, 
one might be tempted to try to obtain §6.5.1 from §6.5 via the 
false conjecture 

n 

V (predicates of order k) is order < nk. 

6.5.2 Mean-Square Variation 

In the expressions for the predicates discussed in §6.5.1, we could 
increase 0 to higher values: 

[E(TV ; - n f ) 2 < 0}. 


Then the system will accept exactly those figures for which the 

sum of the squares of the differences of the TV’s and the n- s are 


Geometric Patterns of Small Order 6.6 [111] 


bounded by 6. Any pattern-classification machine will be sensitive 
to certain kinds of distortion, and this observation hints that it 
might be useful to study such machines, and perceptrons in par- 
ticular, in terms of their spectrum-distortion characteristics. Un- 
fortunately, we don’t have any good ideas concerning the 
geometric meaning of such distortions. The geometric nature of 
this sort of “invariant noise” is an interesting subject for specu- 
lation, but we have not investigated it. 

6.6 Figures in Context 

For practical and theoretical reasons it is interesting to study the 
recognition of figures “in context”: like, 

f(X) = fa subset of X is a square], 

\p(X) = [a connected component ofX is a square], 

or, to begin to consider three-dimensional projection problems, 

\p(X) = \X contains a significant portion of the 
outline of a partially obscured square]. 



The examples show that there is more than one natural meaning 
one could give to the intuitive project of recognizing instances of 
patterns embedded in contexts. We do not know any general 
definition that might cover all natural senses, and are therefore 

unable to state general theorems. We do, nevertheless, claim that 
the genera l rule is for low-order predicates to lose their property 
of finite order when embedded in context in any natural way. To 
illustrate the thesis we shall pick a particularly common and 
apparently harmless interpretation: For any predicate f(X) define 
a new one by 


f in context {X) = [ \p(Y) for some connected component of X). 


[112] 6.6 Geometric Theory of Linear Inequalities 


It will be obvious that the techniques we use can be adapted 
trivially to many other definitions. 

Intuitively, we would expect ^ IN context t° be much harder for a 
perceptron since the context of each component acts as noise and 
the parallel operation of the device allows little chance for this to 
be separated and ignored. The point appears particularly clearly 
in the cases where \p uses rejection rules. These cannot be trans- 
ferred over to \(/ w context f° r very obvious reasons. Similarly, we 
will lose the stratification methods of Chapter 7 and, indeed, most 
of our technical tricks used to obtain low-order representations of 
predicates. The next two theorems show how this intuitive idea 
can be given a rigorous form. It should, however, be observed 
that no simple generalization is possible about the relation of \p 
to ^in context since some i/^’s become degenerate in context. For 
example, ^connected becomes degenerate in context because every 
set has a connected component! 

Theorem 6.6.1: Let R be a finite square retina and let \p{X) be 

\f/(X) = \X is exactly one horizontal line across the retinal. 

Then \p is of order 2 but \p lN context is not of finite order. 

proof: We leave as an exercise the proof that \p as defined has 
order 2. To show that ^ IN context is not of finite order we 
merely observe that it is the negation of the negative of the 
one-in-a-box predicate, \p \ , namely the predicate that asserts 
there is no horizontal white line across the retina. Its negative 
(in the photographic sense) asserts that there is no horizontal 
black line. Now \p { is not of finite order, and one can show in 
general that the same is true of any such predicate's negative. 
Finally, by reversing the predicate’s inequality we find the same 
is true for the desired 

i/'in context = \X contains a horizontal line across the retinal. 
Theorem 6.6.2: Let \p(X) be 


\X is a hollow square!. 



Geometric Patterns of Small Order 6.6 [113] 


Then \f/ lN context is not of finite order: 

[One component of X is a hollow square!. 



Mill! 


- 




— 



TT 



- 


— 


iL 






M 1 1 Li- 



proof: The proof is exactly the same as the previous except that 
the “boxes” or horizontal lines are folded into squares and 
mapped without overlap into a larger retina. Again, it can be 
shown that \p itself is of finite order; in this case, order 3. 

Note: An alternative proof method is to fold the lines of switch- 
ing elements used in the Huffman construction for connectivity 
(§5.5). 

It is our conviction that the deterioration of the perceptron’s 
ability to recognize patterns embedded in other contexts is a 
serious deterrent to using it in real, practical situations. Of course 
this deficiency can be mitigated by embedding the perceptron in a 
more serial process — one in which the figure of interest is isolated 
and separated from its context in an earlier phase. Bui this pre- 
supposes enough recognition ability, in the “preprocessing” 
phase, to discern and remove the most commonly encountered 
contextual disturbances, and this may be much harder than the 
“processing” phase. We treat this further in Chapter 13. 




Stratification and Normalization 

7 


7.1 Equivalence of Figures 

In previous chapters we discussed the recognition of patterns — 
classes of figures closed under the transformations of some group. 
We now turn to the related question of recognizing the equiva- 
lence, under a group, of an arbitrary pair of figures. The results 
below were surprising to us, for we had supposed that such prob- 
lems were not generally of finite order. A number of questions 
remain open, and the superficially positive character of the fol- 
lowing constructions is clouded by the apparently enormous co- 
efficients they require, and the manner in which they increase with 
the size of the retina. 


A typical problem has this form: The retina* 




is presented as two equal parts A and B and we ask: is the figure 
in part B a rigid translation of the figure in part A? More gen- 
erally, is there an element g from some given group G of trans- 
formations for which the figure in B is the result of g operating on 
the figure in A? What order predicates are required to make such 
distinctions? The results of this chapter all derive from use of a 
technique we call stratification. Stratification makes it possible, 
under certain conditions, to simulate a sequential process by a 
parallel process, in which the results are so weighted that, if 
certain conditions are satisfied, some computations will numer- 
ically outweigh effects of others. The technique derives from the 
following theorem: 


* All the theorems of this chapter apply directly to perceptrons on infinite 
retinas; that is, without having to consider limiting processes on sequences of 
finite retinas as proposed in §1.6. The transformation groups, too, are infinite, 
and the group-invariance theorem is not used. Because this material is somewhat 

more specialized than the rest, we will regress a little toward the conventional 
and hideous style of mathematical exposition, in which theorems are stated 
and proved before explaining what they are for. 





Stratification and Normalization 7.2 [115] 

pr^v/cj cl [$<xc\\fous> -fyv inwb 'fe'iAQVSj ujlw^k i$ c<_ 
becaos-e. tke. basic 'i^e<c is Sc? 5/wpte. 

We Uci v/<=- +Wc^o/e cxtld'ed } ok pot^e. <*- nuto-ertcal 

^exawtp le 4o il/os^tfe tVe fetluo l’g^ <*m<4 ){■ ^ o0L/ 

uvvclcv^V^ti^ h<?co 'ib toav'ks Mdtc s^toold boutr V\<? "firooU l-C 

si/ «^»p §73. 

7.2 The Stratification Theorem 

Let II = { 7r j , 7 t 2 , . . . , 7c j , . . . } be a sequence of different masks and 
define a sequence C\, . . . , C 7 , . . . , of classes by 

ItC 7 ^=> [tt ; (X) and (A: > j =► ~tt* (*))]. 

Thus A" is in C ; if j is the highest index for which 7 r,(T) is 
true, as is illustrated below. 



7T, = ? 

7Ti = ? 


3 

11 

0 

/ 

/ 

^7T 2 = 1 

7 r 2 = ? 


7T 3 — 0 

3 

11 / 
7 

L_ 

/ H 

/ _ 


7T 4 = 0 

7T 4 = 0 

7T 4 = 0^ 


V Cl 

Ci 

c 3 

J 


Figure 7. 1 The partition into C, (x). 


Let <J> = j^/j be a family of predicates and let \p \, . . . , 1 / 7 , 
be an ordered sequence of predicates in L(<£>) that are each 
bounded in the sense that for each \pj there is a linear form 
2 y with integer coefficients such that 

2; = X OCijiPi - 6j AND 1 pj = fs, > 01 

/ 

and a bound Bj such that 
|2;(*)| < Bj 

for all finite \X\. (The proof actually requires only that each 
1 2; (A") | be bounded on each C k .) 

Theorem 7.2: The predicate \p(X) = \XeC, =* \pj(X)] obtained 
by taking on each C 7 the values of the corresponding ^ 7 , lie s m 
L(<J> • II); that is, it can be written as a form 


HX) = Woijfaj A<p k ) > 61 



[116] 7.2 Geometric Theory of Linear Inequalities 


proof: It is easy to see that every finite X will lie in one of the Cj. 
Define 

S\ = 7T 1 * 2 1 , 

and for j > 1 define inductively 

{ Mj = max | Sj^i | , 

Sj = Sj _ i — ttj Mj 4- (2 Mj + 1 ) ■ 7Ty * 2 y . 

The bounds Bj assure the existence of the M/s. Now write the 
formal sum generated by this infinite process as 

S = Hotjki'Kj A <p k ), 

and we will show that \p(X) = [5(2f) > 01. The infinite sum is 
well-defined because for any finite X in any Cj there will be only 
a finite number of nonzero 7 r 7 - A terms. Base : It is clear that if 
X is in C i then S x = so \p(X) = [5i(2f) > 01. Induction: As- 
sume that if X is in Cy_ j then \^(X) = [S'y.iCX) > 01. Now the 
coefficients are integers, so if XeCj, ttj = 1 and 


Jif \f/(X) then > 1 so 5, > [-Mj - Mj + 2 Mj + 1] = 1, 

[if ~$(X) then 2, < 0 so Sj < [Mj - M )) = 0. Q.E.D. 

Corollary 7.2: The order of \p(X) is no larger than the sum of the 
m aximum [support! in $ and the maximum [support! in II. 
This follows because the predicates in $ occur only as conjuncts 
with predicates in II. 

The idea is that the domain of \p(X) is divided into the disjoint 
classes or “strata,” C y . Within each stratum the — tv j M ; term is 
so large that it negates all decisions made on lower strata, unless 
the \pj test is passed. In all the applications below, the strata 
represent, more or less, the different possible deviations of a figure 
from a “normal” position. Hence there is a close connection be- 
tween the possibility of constructing “stratified” predicates, and 
the conventional “pattern recognition” concept of identifying a 



Stratification and Normalization 7.3 [117] 


figure first by normalizing it and then by comparing the normal- 
ized image with a prototype. This, of course, is usually a serial 
process. 

It should be noted that predicates obtained by the use of this 
theorem will have enormous coefficients, growing exponentially 
or faster with the stratification index j. Thus the results of this 
chapter should not be considered of practical interest. They are 
more of theoretical interest in showing something about the rela- 
tion of the structure of the transformation groups to the orders 
of certain predicates invariant under those groups. 

7.3 Application 1 : Symmetry along a Line 

Let R = . . . , x s , . . . , be the points of an infinite linear retina, that 
is, - oo < s < oo ; it is convenient to choose an arbitrary origin 
x 0 and number the squares as shown: 


0 



XL, 

*0 

X , 



□ 


Suppose that X is a figure in R with finite \X\. We ask whether 
A symmetrical ("^ 0 = f X has a symmetry under reflection! 
is of finite order. 

It should be observed that the predicate would be trivially of 
order 2 if the center of symmetry were fixed in advance. But 
Asymmetrical allows it to be anywhere along the infinite line. 

We will “stratify” Asymmetrical by finding sequences 7r!,...,and 
Ai,...,that allow us to test for symmetry, using the following 
trick: the tt/s will “find” the two “end points” of X and the 
corresponding \p-s will test the symmetry of a figure, assuming 
that it has exactly these end points. Our goal, then, is to define the 
tt/s so that each Cj will be the class of figures with a certain pair 
of end points. To do this we need ttj, . . . , to be an enumeration 
of all segments (x s ,x s+d ) for every s and for every d > 0, with 
the property that any term Cx 5 ,x J+£ /) must follow any term 
with 0 < a < b < d. There do indeed exist such 
sequences, for example: 




[118] 7.3 Geometric Theory of Linear Inequalities 


7T, = X 0 X 0 
7T 2 = X { Xi 
7T3 = XqX i 
7 r 4 = 

7T 5 = X_!X 0 
7T 6 = X_\X { 
7 r 7 = x 2 x 2 

7T 8 = XiX 2 




0 

L 




i 






_ 0 _ 

i 








-i 

0 




-l 


i 






2_ 




i 

2 



0 


2 




-i 



2 

-2 






-2 

-i 





-2 


0 





Jlij 



i 




3 




2 





1 



3 


1 




2 

3 





i 


3 






0 



3 


It can be seen that (1) each segment occurs eventually, and (2) no 
segment is ever followed by another that lies within it. Therefore, 
if x s , x s+d are the extreme left and right points of X, then X will 
lie in precisely the C, for that {x s , x s+d ). Now define \f/j to be 

A j = = Xs+d-ii i ~ 0 , . . . , d\ 


or, equivalently, 

" d 

j — ^ 1 + — X s+d-i ) ^ 


•1 


showing that it is a predicate of order 2 bounded by Bj = d + 1. 


So, finally, application of the stratification theorem shows that 
Asymmetrical has order <4, since the A’s have order <2 and the 
7r’s have support <2. 


7.4 Application 2: Translation-Congruence along a Line 

Let . . . , x 5 , . . . , and . . . , y t , . . . , be the points of two infinite linear 
retinas, that is, - oo < x s < <*> and — °o < y t < oo : 



Let X be a figure composed of a part X A in the left retina and a 
part X B in the right retina. We want to construct 



Stratification and Normalization 7.4 [119] 


^translate(^) = fthe (finite) pattern in A is a translate of the 
pattern in B]. 

To “stratify” ^translate we have to find a sequence 7r ; that allows 
us to test, with appropriate i^/s, whether the A and B parts of X 
are congruent. We will do this by a method like that used in 
§7.2.1, but we now have to handle two segments simultaneously. 
That is, we need a sequence of 7r/s that enumerates all quadruples 
in such a way that a figure lies in C 7 if and only if the end points 
of its A and B parts are precisely the corresponding values of 
x s , x s+dx , y t , and y t+d There does indeed exist such a sequence 
(!), and one can be obtained from the w's of §7.2.1 as follows (the 
reader might first try to find one himself). 

Define ir jk to be the four-point mask obtained by 
t Tjk(X) = irj(X A ) • ir k (X B ), 

that is, by choosing according to i two points of A and according 
to j two points of B. The master sequence requires us to enumer- 
ate all Wi/s under the condition that no 7r ab can precede any 
7r c jifbothtf > c and b > d. 

A solution is 



5 

5 


—4 



[120] 7.4 Geometric Theory of Linear Inequalities 


7T 1 1 ? ^21 ? 12? ^22? ^*31 ? 32, 7Tl3? ^*23? ^"33? ^"41? ^*42? ^43? ^*14? ^"24? • • • » 

and for the w jk term in this sequence, an appropriate predicate 

^(jk) is 


ip(jk) = [the segments defined by ttj and ir k have the same lengths, 
and the x's andy’s in those intervals have the same values 
at corresponding points]. 

This is an order-2 predicate, and bounded (by the segment 
lengths). The tt/s now have support 4, so ^translate 00 has finite 
order <6. Actually, having found both extrema of X A , it is neces- 
sary only to find one end of X B , so a slightly different construction 
using the method of §7.9 would show that the order of ’/'translate 
is <5. 


7.5 Application 3: Translation on the Plane 

The method of application 2 can be applied to the problem of the 
two-dimensional translations of a bounded portion of the plane 
by using the following trick. Let each copy of the retina be an 
( m x m) array. Arrange the squares into a sequence {x,} with 
the square at ( a , b ) having index ma + b. In effect, we treat the 
retina as a cylinder and index its squares so: 



This maps each half of the retina onto a line like that of applica- 
tion 2 in such a way that for limited translations that do not carry 
the figure X over the edge of the retina , translations on the plane 
are equivalent to translations on the line, and an order-5 predicate 
can be constructed. In §7.6 we will show how the ugly restriction 
just imposed can be eliminated! 

Application 4. 180° rotation about undetermined point on the plane. 
With the same kind of restriction, this predicate can be con- 


Stratification and Normalization 7.6 [121] 


structed (with order 4) from application 1 by the same route that 
derived application 3 from application 2. Similarly, we can detect 
reflections about undetermined vertical axes. 

7.6 Repeated Stratification 

In the conditions of the stratification theorem, the only restriction 
on the i^/s is that they be suitably bounded. In certain applica- 
tions the \f/Js themselves can be obtained by stratification. This is 
particularly easy to do when the support of \pj is finite, for then 
boundedness is immediate. To illustrate this repeated stratifica- 
tion we will proceed to remove the finite restriction in application 
3 of §7.5. 

First enumerate all the points of each of two infinite plane retinas 
A and B according to the more or less arbitrary pattern: 



Figure 7.2 


to obtain two sequences x u . . . , x s , . . . , and y \, , . . ,y t , 

Now we will invoke precisely the same enumeration as in §7.4, but 
with the definition 

TT jk (X) = ( XjtX A and y/c e X B ) = Xj-y k . 

Then C {jk) is the class of pairs (X A ,X B ) for which 

Jj = max {six, « X A ] 
l k = max \t\y, t X B }. 



[122] 7.6 Geometric Theory of Linear Inequalities 






We need only a (bounded) \[/ uk) that decides whether X A is a 
translate of X B for figures in C Jk . But the figures in C jk all lie 
within bounded portions of the planes, in fact within squares of 
about [max (y, k )\ 1/2 on a side around the origins. Within such a 
square — or better, within one of twice the size to avoid “edge- 
effects” — we can use the result of application 3, §7.5, to obtain a 
predicate \[/ uk) with exactly the desired property, and with finite 
support! The resulting order is <5 + 2 = 7. We have another 
construction for this predicate of order < 5. The same argument 
can be used to lift the restrictions in application 4 of §7.5. 

7.7 Application 5: The Axis-parallel Squares in the Plane 

We digress a moment to apply the method of the last section to 
show that the predicate 

i (A") = [A" is a solid (hollow) axis-parallel square], 

where the form may lie anywhere in the infinite plane, has order 
<3. 





Stratification and Normalization 7.7 [123] 


(We consider this remarkable because informal arguments, to the 
effect that two sides must be compared in length while the interior 
is also tested, suggest orders of at least 4. The result was dis- 
covered, and proven by another method, by our student, John 
White.) 

We enumerate the points x u . . . , of a single plane, just as in §7.6 
and simply set it, = x,. Then C, is the set of figures whose 
“largest” point is x,. If A' is a square, the situation is like one of 
the cases shown in Figure 7.4. We then construct \pj by stratifying 



Figure 7.4 


as follows: Let x J u x J 2 , . . . , x j nj be the finite sequence obtained by 
stepping into the spiral figure orthogonally from Xj. Define 7 rj = 
x{ so that Cj will contain all the squares of length i on a side that 
are “stopped” by x r But there is only one such square, call it 
Sj. So to complete the double stratification we need only provide 
predicates \p J ,to recognize the squares S J ,. But this can be done by 

\l/ J i = [2 a k x k > i 2 1 


where 

C 1 if Xk t S J i 
a k =< -1 if x k j S{ A k < j 
1=0 otherwise 


[124] 7.7 Geometric Theory of Linear Inequalities 


Then \p{ is of order 1 . So \p ^ has order < 3 ! Q.E.D. 

7.8 Application 6: Figures Equivalent under Translation 
and Dilation 

Can a system of finite order recognize equivalence of two arbi- 
trary figures under translation and size change? 




Some reflection about the result and methods used in §7.6 and 
§7.7 will suggest that we have all the ingredients, for §7.6 shows 
how to handle translation, and §7.7 shows how to recognize all 
the translations and dilations of a particular figure. Now dilation 
involves serious complications with tolerance and resolution 
limits, in so far as our theory is still based on a fixed, discrete 
retina, and we do not want to face this problem squarely. None 
the less, it is interesting that the desired property can at least 
be approximated with finite order, in an intuitively suggestive 
fashion. (We do not think that a similar approximation can be 
made in the case of rotation invariance, because the problem 
there is of a different kind, one that cannot be blamed on the 
discrete retina. Rather, it is because the transformations of a 
rotation group cannot be simply ordered, and this “blocks” 
stratification methods.) 

Our method begins with the technique used in §7.6 to find predi- 
cates 7 r ijk) that “catch” the two figures in boxes. Then, just as 
in §7.6, the problem is reduced to finding predicates that need 
only operate within the boxes of Figure 7.2. We construct the 
ip ijk ) s by a brutal method: within each box we use the simple 
enumeration of points described in §7.5. Then we stratify four 
times (!) in succession with respect to 

x = highest and leftmost point of A, 





Stratification and Normalization 7.9 [125] 


y = highest and leftmost point of B, 
x' = lowest and rightmost point of A, 
y’ = lowest and rightmost point of B. 

We will need to define predicates \px jk yy for this. If the two 
vectors x - jc' and y - y' do not have the same direction we set 
\p = 0; otherwise we need a \p to test whether or not for every 
vector displacement v 

\x - x 1 | 

y + v = x + • v, 

\y - y' I 

and this is an order-2 predicate, leading finally to total order 
<2 + 4 + 2 = 8. Of course, on the discrete retina the indicated 
operations on vectors will be ill-defined, but it seems clear that 
the result is not vacuous: for example, we could ask for recogni- 
tion of the case where X B is a translate and an integer multiple 
of X A in size, with each black square of X A mapping into a 
correspondingly larger square in X B . We have another construc- 
tion for this predicate of order < 6. 

7.9 Application 7: Equivalents of a Particular Figure 

In constructing \p for application 5, we noted that one can always 
construct an order- 1 predicate to detect precisely one particular 
figure X 0 by using \2 X€ x 0 x + > 11. It follows that if we 

can construct a stratification { 7T, } for a group G such that 

X e Ci AND gX 6 Cj =» ( gX = X ), 

then we can recognize exactly the (/-equivalents of a given figure 
T 0 (with one order higher than the order used by the stratification 
7 r’s). This is suggestive of a machine that brings figures into a 
normal form in the first stage of a recognition process. For this 
case our general construction method takes a very simple form: 
Consider a particular figure X 0 consisting of the ordered sequence 
of points { *,• , . . . , x ip } on the half-line 





[126] 7.9 Geometric Theory of Linear Inequalities 


Let ttj(X) = \xj6 X] and define ypj{X) as 

x k _ j+ip eX 0 ]x k + T,\xk-j+i p eX 0 and k < j]x k < l"j 

ignoring for the moment points with negative indices. Then, 
except for “edge effects” we obtain a predicate of order 2 that 
recognizes precisely the translates of X 0 . Next we observe that 
there is really no difficulty in extending this to the two-way infinite 
line, for we can enumerate the 7r/s in the order 

Cl * 3 1 *l| *o| *il * 2 | Z 3 \ [7 

i ’ ' ' Ifj TTj TTj_ tt 3 tt 5 tr 7 ■ • 1 

so that if a figure ends up in class C 2 j we will have found its 
leftmost point X_ j9 and if it is in a C 2 j+\ we will have found its 
rightmost point Xj. In either case we can construct an appropriate 
yp. Hence, finally, we see that there exists for any given figure 
X 0 a predicate of order 2 that recognizes precisely the linear 
translations of X 0 , and there is no problem about boundedness 
because all \p supports are finite. 




*-2 


*0 

*1 

*2 

X 3 





** 

1 

*3 

*5 

*7 




7.10 Apparent Paradox 

Consider the case of X 0 


We have just shown that there exists a \p of order 2 that accepts 
just the translate s of this fi gure. Hence \p must reject the non- 
equivalent figure, But both of these figures have exactly 

the same w-tuple distribution spectrum (see §6.2 and §6.5) up to 
order 2! Each has 3 points, and each has 1 adjacent pair, 1 pair 
two units apart, and 1 pair 3 units apart. Therefore, if all group- 
equivalent (p's had the same weights, a perceptron of order > 3 
would be needed to distinguish them. Thus if we could apply the 
group-invariance theorem we would in fact obtain a proof that no 
perceptron of order 2 can distinguish between these. This would 
be a contradiction! What is wrong? The answer is that the group- 
invariance theorem does not in general apply to predicates in- 
variant under infinite groups. When a group is finite, for example, 
the cyclic translation group of the toroidal spaces we have con- 
sidered from time to time, one can always use the group-invariance 
theorem to make equal the coefficients of equivalent (p's. But we 
cannot use it together with stratification to construct the predi- 
cate on infinite groups. 




Stratification and Normalization 7.11 [127] 


With infinite groups we can use stratification for normalizing, 
but then we must face the possibility of getting unbounded co- 
efficients within equivalent (p's ; and then the group-averaging 
operations do not, in general, converge. This will be shown as a 
theorem in §10.4. 

We conjecture that predicates like the “twins” of §7.5 are not of 
finite order with bounded coefficients. In any case, it would be 
interesting to know whether there are such predicates. 

7.11 Problems 

A number of lines of investigation are intriguing: what is the 
relation between the possible stratifications, including repeated 
ones, and algebraic decompositions of the group into different 
kinds of subgroups? For what kinds of predicates can the group- 
invariance theorem be extended to infinite groups? What predi- 
cates have bounded coefficients in each equivalence class, or in 
each degree? Under what conditions do the “normal-form strati- 
fications” of application 7 exist? For example, we conjecture that 
on circles or toroids , there is no bound on the order of predicates 
xj/ that select unique “normal form” figures under rotation groups: 

MX) and UgX) X = gX. 

We suspect that this may be the reason we were unable to extend 
the method of application 6 to the full Euclidean similarity group, 
including rotation. 

We note that the condition in Theorem 7.2 that the predicates 
{ 7r y } be masks is probably stronger than necessary. We have 
not looked for a better theorem. 

Stratified predicates probably are physically unrealizable because 
of huge coefficients. It would be valuable to have a form of 
Theorem 7.2 that could help establish lower bounds on the 
coefficients. 

A stratification seems to correspond to a serial machine that 
operates sequentially upon the figure, with a sequence of group 
transformation elements, until some special event occurs, estab- 
lishing its membership in C y , and then applies a “matching test” 
corresponding to \ pj. The set of \pj s must contain information 



[128] 7.11 Geometric Theory of Linear Inequalities 


about the figures in all transformed positions, so the possibility 
of a perceptron accomplishing such a recognition should not 
suggest that the machine has any special generalization ability 
with respect to the group in question; rather, it suggests the op- 
posite! The apparent enormity of the coefficient hierarchies casts 
a gloomy shadow on the practicality of learning stratified co- 
efficients by reinforcement, since reinforcing a figure in C, cannot 
work until it has depressed all discriminations in all preceding 
classes. This is discussed further in Chapter 10 and 1 1. 

Example: +o recognise +L ie translates of 'Hie pccHern 

Lcf R be +lie half- line [xt/Xi |#3/ • • • I amJ let be the. 
desired frcdica+e :fV_haS exactly 3 hlaj{ scajavesjn the matter h X~] 
We ajiii slnow tktf Y has order 3. First aie defMe <x. soeeia\ 
predicate yfj 4ov- ea-cW instance °-f ip <xs -fykows-, ' 

= prWe. Hiwstj squares of / IS exactly I . — ~~KE ^7 

r ^<i >0 S o = (^i + ^2 + --' +Z J -3 +5 a- i ^- ( +: Ej- + j-j) 

Note 'Hiaf each V\as cyder 1 Make 2 f = O 

Mow aje can excess 'jp(X) 4.S £>ontetk(n^ like 

f(y )= 15 flie r^h'fmst klach s^o&ife. ef X~f 

hoi can we express tie trnp/ieX selection 4 the correct Yj 
ojithin the 4 par 'Hive^lioii -framework? Yes by us/»j <a tvickj 

Letr he a. seance 4 numbers tW <^oa>s k rge ver^ 

M.rlo M x z io‘° - . . M i+| = l 0 M E Then 

f-fi W t^ + M S*s2 J5 + • • • + M j X J^s V " >0 "I 

because tine terw uuH'i the wos\ b(ack ?C <-uV( 
we\<^h a.11 earlier +e/r*x <xnd c# de+e^Wftfe tne Sfcjn 
ite uAc>(e s\jvy\ The e«+ ire cl natter* is loosed c>rt vctviouS 
-k? €*^UV +W sip^le biM.rre co\h : e^- The 
Zj term^ at th/4 ex**n^e co vie ^txndl 4© the Ti 
text. 



The Diameter-Limited Perceptron 

8 


8.0 

In this chapter we discuss the power and limitations of the 
“diameter-limited” perceptrons: those in which ip can see only a 
circumscribed portion of the retina R. 

We consider a machine that sums the weighted evidence about a 
picture obtained by experiments <p h each of which report on the 
state of affairs within a circumscribed region of diameter less than 
or equal to some length D, that is. Diameter (S(cp)) < D. 

One gets two different theories when, in considering diameter- 
limited predicate-schemes, one takes D as 

(1) an absolute length, or 

(2) a fixed fraction of the size of R. 

Generally, it is more interesting to choose (1) for positive results. 
For negative results, (1) is usually a special case of an order- 
limited theory, and (2) gives different and sometimes stronger 
results. The theory did not seem deep enough to justify trying to 
determine in each case the best possible result. From a practical 
point of view one merely wants D to be small enough that none of 
the ^ s see the whole figure (for otherwise we would have no 
theory at all) and large enough to see interesting features. 

8.1 Positive Results 

We will first consider some things that a diameter-limited percep- 
tron can recognize, and then some of the things it cannot. 

8.1.1 Uniform Picture 

A diameter-limited perceptron can tell when a picture is entirely 
black, or entirely white: choose ip's that cover the retina in regions 
(that may overlap) and define <p,- to be zero if and only if all the 
points it can see are white. Then 


2ipi > 0 


if the picture has one or more black points, and < 0 if the picture 
is blank. Similarly, we could define the ip's to distinguish the all- 
black picture from all others. 

These patterns are recognizable because of their “conjunctively 
local” character (see §0.6): no <^-unit can really say that there is 



[130] 8.1 Geometric Theory of Linear Inequalities 


strong evidence that the figure is all white (for there is only the 
faintest correlation with this), but any (p can definitely say that it 
has conclusive evidence that the picture is not all white. Some 
interesting patterns have this character, that one can reject all 
pictures not in the class because each must have, somewhere or 
other, a local feature that is definitive and can be detected by what 
happens within a region of diameter D. 

8.1.2 Area Cuts 

We can distinguish, for any number S , the class of figures whose 
area is greater than S . To do this we define a ip p for each point 
to be 1 if p is black, 0 otherwise. Then 

2v P > S 

is a recognizer for the class in question. 

8.1.3 Triangles and Rectangles 

We can make a diameter-limited perceptron recognize the figures 
consisting of exactly one triangle (either solid or outline) by the 
following trick: We use two kinds of (p's: the first has value + 1 if 
its field contains a vertex (two line segments meeting at an angle), 
otherwise its value is zero. The second kind, <p h has value zero if 
its field is blank, or contains a line segment, solid black area, or a 
vertex, but has value + 1 if the field contains anything else, in- 
cluding the end of a line segment. Provide enough of these $ s so 
that the entire retina is covered, in nonoverlapping fashion, by 
both types. Of course, this won’t work when a vertex occurs at 
the edge of a ^-support. By suitable overlapping, and assignment 
of weights, the system can be improved, but it will always be an 
approximation of some sort. This applies to the definition of “line 
segment,” etc., as well as to that of “vertex.” See §8.3. Finally 
assign weight 1 to the first type and a very large positive weight W 
to those of the second type. Then 

2 pi + W 2 ip i < 4 

will be a specific recognizer for triangles. (It will, however, accept 
the blank picture, as well). Similarly, by setting ps to recognize 
only right angles, we can discern the class of rectangles with 

2 ^ + WZip, < 5. 



The Diameter-Limited Perceptron 8.2 [131] 


A few other geometric classes can be captured by such tricks, but 
they depend on curious accidents. A rectangle is characterized by 
having four right angles, and none of the exceptions detected by 
the ip- s. In §6.3.2 we did this for axis-parallel rectangles: for 
others there are obviously more serious resolution and tolerance 
problems. But there is no way to recognize the squares, even axis- 
parallel, with diameter-limited <£>’ s; the method of §7.2.5 can’t be 
so modified. 

8.1.4 Absolute Template-matching 

Suppose that one wants the machine to recognize exactly a certain 
figure X 0 and no other. Then the diameter-limited machine can be 
made to do this by partitioning the retina into regions, and in 
each region a (p function has a value 0 if that part of the retina 
is exactly matched to the corresponding part of X 0 , otherwise the 
value is 1. Then 

2 (p < 1 

if and only if the picture is exactly X 0 . 

Note, however, that this scheme works just on a particular object 
in a particular position. It cannot be generalized to recognize a 
particular object in any position. In fact we show in the next 
section that even the simplest figure, that consists of just one 
point, cannot be recognized independently of position! 

8.2 Negative Results 

8.2.1 The Figure Containing One Single Black Point 

This is the fundamental counterexample. We want a machine 

2 a^(p > 6 

to accept figures with area 1, but reject figures with area 0 or area 
greater than 1. To see that this cannot be done with diameter- 
limited perceptrons, suppose that )<^J, {a\, and 6 have been 
selected. Present first the blank picture X Then if f(X) 
2ai<pj(X), we have f (X 0 ) < 6. Now present a figure X x con- 
taining only one point x ]m We must then have 


f(X i) > 0. 



[132] 8.2 Geometric Theory of Linear Inequalities 


The change in the sum must be due to a change in the values of 
some of the *’ s. In fact, it must be due to changes only in *’s 
for which x x e S(<p), since nothing else in the picture has changed. 
In any case, 

AX i) - Ax 0 ) > o. 

Now choose another point x 2 which is further than D away from 
X\. Then no S((p) can contain both x x and x 2 . For the figure X 2 
containing only x 2 we must also have 

AX 2 ) = 2 a,*, > 6. 

Now consider the figure X x2 containing both x x and x 2 . The addi- 
tion of the point x x to X 2 can affect only y s for which x x eS((p), 
and these are changed exactly as they are changed when the all- 
blank picture X 0 is changed to the picture X , . Therefore 

f{X n ) = /( X 2 ) + [/(*,) - f(X o)]. 

But then the two previous inequalities yield 
AXn) > e 

which contradicts the requirement that 

AXn) < 0. 

Of course, this is the same phenomenon noted in §0.3 and §2.1. 
And it gives the method for proof of the last statement in §8.1.3. 

8.2.2 Area Segments 

The diameter-limited perceptron cannot recognize the class of 
figures whose areas A lie between two bounds A x < A < A 2 . 

proof: this follows from the method of §8.2.1, which is a special 
case of this, with A x = 1 and A 2 = 1. We recall that this recogni- 
tion is possible with order 2 if the diameter limitation is relaxed 
using the method of §1.4, example 7. 

8.2.3 Connectedness 

The diameter-limited perceptron cannot decide when the picture 
is a single, connected whole, as distinguished from two or more 



The Diameter-Limited Perceptron 8.3 [133] 


disconnected pieces. At this point the reader will have no dif- 
ficulty in seeing the formal correctness of the proof we gave of 
this in §0.8. 

8.3 Diameter-limited Integral Invariants 

We observed in §6.3.1 that convexity has order 3, but that the construc- 
tion used there would not carry over to the diameter-limited case, because 
it would not reject a figure with two widely separated convex com- 
ponents. On the other hand, §8.1.3 shows how a diameter-limited predi- 
cate can capture some particular convex figures. The latter construction 
generalizes, but leads into serious problems about tolerance and into 
questions about differentials. 

Suppose that we define a diameter-limited family of predicates $ c using 
the following idea: Choose an e > 0. Cover R with a partition of small 
cells Cj. For each integer k define <pj k to be 1 if C 7 D X contains an 
“edge” with change-in-direction greater than ke and otherwise <pjk = 0. 


Now consider the “integral” 

« <pjk- 

jk 

The contribution to the sum of each segment of curve will be e • c/e = 
c, where c is the magnitude of the change in direction of the segment; 
hence the total sum is the “total curvature.” Finally we claim that we can 
realize ^convex as 

€ <p jk < 2w , 
jk 





because the total curvature of any figure must be > 2ir and only (and all) 
convex figures achieve the equality. We ignore figures that reach the edge 
of the retina and such matters. 

A similar argument can be used to construct a predicate that uses the 
signed curvature to realize functions of the Euler characteristic of the 
form G{X) < n , since that invariant is just the total signed curvature 



[134] 8.3 Geometric Theory of Linear Inequalities 


divided by 2tt. Of course on the quantized plane the diameter-limited 
predicate of §5.8.1 does this more simply. 

One could go on to describe more sophisticated predicates that classify 
figures by properties of their “differential spectra.” 

However, we do not pursue this because these observations already raise 
a number of serious questions about tolerances and approximations. 
There are problems about the uniformity of the coverings, the sizes of e 
and the diameter-limited cells Cy, and problems about the cumulative 
errors in summing small approximate quantities. Certainly within the 
E 2 — ► R square map described in Chapter 5, or anything like it, all 
such predicates will give peculiar results whenever the diameter cells are 
not large compared to the underlying mesh, or small compared to the 
relevant features of the T’s. The analysis, in §9.3, of ^convex attempts 
to face this problem. 

For example, we can regard the recognition of rectangles, as done in 
§6.3.2, as a pure artifact in this context, because it so depends on the 
mesh. The description in §8.1.3 of another form of the same predicate 
is worded in such a way that one could make reasonable approximations, 
within reasonable size ranges. 

8.4 Proof of Uniqueness of the Eulerian Invariants for 
Diameter-Limited Perceptrons 

In this section we show, as promised at the end of Chapter 5, that 

Theorem 8.4: Diameter-limited perceptrons cannot recognize any 
nontrivial topological properties except the Eulerian predicates 


\E(X) > n 1 and [ E(X) < n], 

proof: The argument of §5.8 shows that \p(X) must be a function 
of E{X). This is immediate for the absolute diameter-limit, which 
is a special case of order-limit. The argument, with suitable modi- 
fications, carries over to relative diameter-limits. Now consider 
two figures A and B that differ only in a single interior square: 




The Diameter-Limited Perceptron 8.4 [135] 

A^her sense o{ "local'] tkx+ -fuses fk<£ Jiawefev'-liwif 

ctioc( o^ev'-li}oi/'t i t>6i/wfs each cf> io defend on n. ve^iotys 

o^r smalt clique' n?iP Xle "Vlaqf(" suck a- h as PiFFERFror/^L 

prpfr n. . Moe ^v^JicA-fe r>( ;s cc ^of/'i s^uave * Was cUffevevihal o\fe(e\r 
We j acm/ fee( “tWt" Tkus is tin ^ wosf t ^ 4e ires fl 1^0 ‘Vesfncfftfj'i' 1 fsee P. 

■f°\r irese^irck on cc^puf^f/ottet ( mefAj ^laviij pvacf/cW 

n^<a> rhc M6c~f /c a i cxnj p^y Slo logical veCjSQ\AS . keLa;} 

where the circle shows the range of the diameter limit. Then 
suppose that \p(X) = [2 <p(X) > d] and consider the difference 

A = - Za^fp(A). 

Now if it happens that A > 0 then 

HB) > HA), 

hence removing a hole cannot decrease \p. By topological equiva- 
lence, adding a component has the same effect upon E(X) and 
hence upon \p(X). Thus, if A > 0, then 

E(B) > E(A) =► HB)>HA), 

and similarly, if A < 0 then 

E(B) > E(A ) =► HB)<HA). 

It follows that in each case there must be an n such that (if 
A > 0) 

HX) = f E(X) > n] 

or (if A < 0) 

HX) = \E(X) < n) 
or else \p is a constant. 

The trivial exceptions are the constant predicates and the uni- 
form predicates of §8.1.1, which are exceptions to the canonical 
form of §5.8. 

TTie di-^eirenfiaLov-Jev- \<U<L swfteeh fr*€ the tW^y fi r°vn a wnoyin^ 

(w\eWd:£ of "the disc ye {e ^-fifioH of fiie'Vehka" (r^ooo we ox * W/c akorf 

AjLj u/iftiout- ye^ev-eiace f© l^e airfifid*l c(/\edse^hooi^y \uc cav\ vnore- 

clewlij ~fu:e ^rot|e^s alwf aj|>ir0Xiim4/0(A ewovs avU -klemn&z t deal JiwcHy 
uuifii co^cejVs blfc cowflwo/K , <9r ^KM/exify, yepk<:(yi4 

p|x? ^ i 6 ^ uy w^-fwcd cowibm«fiov\c(( onoo^-s vnove <Wvof\n<<.te /HffliWs 
oT V v i€fv'i^<x( CKnd { \ov\ TVi €tfy/e$ 



Geometric Predicates and Serial Algorithms 

9 


9.0 Connectedness and Serial Computation 

It seems intuitively clear that the abstract quality of connected- 
ness cannot be captured by a perceptron of finite order because of 
its inherently serial character: one cannot conclude that a figure is 
connected by any simple order-independent combination of 
simple tests. The same is true for the much simpler property of 
parity. In the case of parity, there is a stark contrast between 
our “worst possible” result for finite-order machines (§3.1, §10.1) 
and the following “best possible” result for the serial computa- 
tion of parity. Let Xi, x 2 , . . . , x„ be any enumeration of the points 
of R and consider the following algorithm for determining the 
parity of \X\: 


start: Set / to 0. 

even: Add 1 to i. 

If / = |/?| then stop; i/' PARITY = 0. 

If X/ = 0, go to even; otherwise go to odd. 

odd: Add 1 to /. 

If / = \R\ then stop; ^ PARIXY = 1. 

If X/ = 0, go to odd; otherwise go to even. 


Now this program is “minimal” in two respects: first in the num- 
ber of computation-steps per point, but more significant, in the 
fact that the program requires no temporary storage place for 
partial information accumulated during the computation, other 
than that required for the enumeration variable /. [In a sense, the 
process requires one binary digit of current information, but this 
can be absorbed (as above) into the algorithm structure.] 

This suggests that it might be illuminating to ask for connected- 
ness: how much memory is required by the best serial algorithm? 
The answer, as shown below, is that it requires no more than 
about 2 times that for storing the enumeration variable alone! To 
study this problem it seems that the Turing-machine framework is 
the simplest and most natural because of its uniform way of 
handling information storage. 

9.1 A Serial Algorithm for Connectedness 

Connectedness of a geometric figure X is characterized by the fact 
that between any path {p,q ) of points of X there is a path that lies 



Geometric Predicates and Serial Algorithms 9.1 [137] 


entirely in X. An equivalent definition, using any enumeration 
X \, . . . , x\ R \ of the points of R is: X is connected when each 
point x t in X , except the first point in X , has a path to some Xj in 
X for which i > j. (Proof: By recursion, then, each point of X is 
connected to the first point in X.) Using this definition of con- 
nectedness we can describe a beautiful algorithm to test whether 
X is connected. We will consider only figures that are “reasonably 
regular” — to be precise, we suppose that for each point x t on 
a boundary there is defined a unique “next point” x t * on that 
boundary. We choose x t * to be the boundary point to one’s right 
when standing on x t and facing the complement of X. We will also 
assume that points x t and x i+ \ that are consecutive in the enumera- 
tion are adjacent except at the edges of R. Finally, we will assume 

that X does not touch the edges of the space R. Assuror a\$o iW X neifey 

becomes jUst ov\e square 

<xS vni5|<3^i^iu 

Set i to 0 and go to search. sV) 0 «/*v in -toe next- 

Add 1 to /. If / = |/?| , stop and print “X is 
NULL.” 

If Xi e X then go to scan, otherwise go to search. 

Add 1 to i. If / = \R \ , stop and print “X is con- 
nected. ” 

If Xj _ j / X and X (€ X then set j to / and go to trace, 
otherwise go to scan. 

Set j to j*. 

If j = z, stop and print “X is disconnected.” 

If j > z, go to TRACE. 

If 7 < Z, gO tO SCAN. 

Notice that at any point in the computation, it is sufficient to 
keep track of the two integers i and j; we will see that no extra 
memory space is needed for |/?|. 

Analysis: search simply finds the first point of X in the enumera- 
tion of R. Once such a point of X is found, scan searches through 
all of R, eventually testing every point of X. The current point, 
x,-, of scan is tested as follows: If x t is not in X, then no test is 
necessary and scan goes on to x l+l . If the previous point x,_j was 
in X (and, by induction, is presumed to have passed the test) then 
x h if in X, is connected to x { _i by adjacency. Finally, if x l e X and 


start: 

search: 

scan: 


trace: 



[138] 9.1 Geometric Theory of Linear Inequalities 






Geometric Predicates and Serial Algorithms 9.2 [139] 


before, or (2) B is an interior boundary curve, in which case a 
point of B must have been encountered before reaching which 
is inside B, or (3) B is the exterior boundary curve of a never- 
before-encountered component of X, the only case in which 
trace will return to x,- without meeting an Xj for which j < i. 
Thus scan will run up to i = |i?| if and only if X has a single 
nonempty connected component (see Figure 9.1). 

9.2 The Turing-Machine Version of the Connectedness Algorithm 

It is convenient to assume that i? is a 2" x 2" square array. Let 
X \ , . . . , JC|*| be an enumeration of the points of R in the order 


1 , 2 " + 1 , 

2 , 2 " + 2 , 


2 ", 2 " + 2 ", 


( 2 " - 1 ) 2 " + 1 , 

(2 n - 1 ) 2 " + 2 , 


( 2 " - 1 ) 2 " + 2 ". 


This choice of dimension and enumeration makes available a 
simple way to represent the situation to a Turing machine. The 
Turing machine must be able to specify a point x,- of R, find 
whether x,- e X , and in case x t is a boundary point of X, find the 
index /* of the “right neighbor” of x t . The Turing-machine tape 
will have the form 

} 

where “. .n . .” denotes an interval of n blank squares. Then the 
intervals to the right of I x and I y can hold the x and y coordinates 
of a point of R. 

We will suppose that the Turing machine is coupled with the out- 
side world, that is, the figure X, through an “oracle” that works 
as follows: certain internal states of the machine have the 
property that when entered, the resulting next state depends on 
whether the coordinates in the /(or J) intervals designate a point 
in X. It can be verified, though the details are tedious, that all 
the operations described in the algorithm can be performed by a 
fixed Turing machine that uses no tape squares other than those 


I* 

... ... 

I* 

HP 


■ . . ~n . . . 

J# 

EE 

K 




[140] 9.2 Geometric Theory of Linear Inequalities 


in the intervals. For example, “/ = |jR|” if and only if 

there are all zeros in the “. .n . .” ’s following I x and I y . “Add 1 to 
/” is equivalent to “start at J y and move left, changing l’s to 0’s 
until a 0 is encountered (and changed to 1) or until l y is met. The 
only nontrivial operation is computing j* given j. But this re- 
quires only examining the neighbors of Xj, and that is done by 
adding ± 1 to the J x and J y coordinates, and consulting the oracle. 

Since the Turing machine can keep track of which “. .n . .” interval 
it is in, we really need only one symbol for punctuation, so the 
Turing machine can be a 3-symbol machine. By using a block en- 
coding, one can use a 2-symbol machine, and, omitting details, we 
obtain the result: 

Theorem 9.2: For any e there is a 2-symbol Turing machine that 
can verify the connectedness of a figure X on any rectangular 
array /?, using less than (2 + e) log 2 \R\ squares of tape ! 

We are fairly sure that the connectedness algorithm is minimal in 
its use of tape, but we have no proof. (In fact, we are very weak in 
methods to show that an algorithm is minimal in storage; this is 
discussed in Chapter 12.) Incidentally, it is not hard to show that 
[ \X\ is prime] requires no more than (2 + e) log 2 \R\ squares 
(and presumably needs more than (2 - e) log 2 \R\). 

We have little definite knowledge about geometric predicates that 
require higher orders of storage, but we suspect that, in an ap- 
propriate sense, recognizing the topological equivalence of two 
figures (for example, two components of X) requires something 
more like |/?| than like log \R\ squares. There are, of course, 
recursive function-theoretic predicates that require arbitrarily 
large amounts of storage, but none of these are known to have 
straightforward geometric interpretations. 

9.2.1 Pebble Automata 

A variant of this computation model has been studied by M. 
Blum and C. Hewitt. The Turing machine is replaced by a finite- 
state automation which moves about on the retina, reading the 
color of the cell on which it is currently located. As a function 
of this input and its current state, the automaton determines its 
next state and one of four possible moves: north, east, south, 
west. A properly designed automaton should operate on any 



Geometric Predicates and Serial Algorithms 9.3 [141] 


retina, however large, provided that it is given a way to detect 
the edge of the array. This is a convenient way to realize the idea 
of a predicate-scheme. 

The position of the automaton on the retina plays the role of one 
of the two print indices I and J remembered by the Turing 
machine. To give the machine the effect of the second point index, 
it can be provided with a pebble that can be left anywhere on the 
retina and retrieved later. We leave to readers the extremely 
tricky exercise of translating the Turing machine algorithm into 
a form suitable for an automaton with one pebble. Can con- 
nectedness be recognized without using the pebble? Surely not, 
but we have not proved it! s + i ,| M a;s '7i! 

9.3 Memory-Tape Requirements for ^convex 

For convexity we can also get a bound on the tape memory. 
However, since convexity is a metric property, one must face the 
problem of precision of measurement vis-a-vis the resolution of 
the finite lattice of R. It seems reasonable to ask that the figure 
have no indentations larger than the order of the size of a 
lattice square. One way to verify this is to check, for each pair 
(a, b) of boundary points, that there is no such indentation: 


x > 



To make this test seems to require the equivalent of scanning the 
squares close to the ideal line from a to b, and some memory is 
required to stay sufficiently close to its slope. For each increment 


[142] 9.3 Geometric Theory of Linear Inequalities 


in (say) y one must compute for x the largest integer in 


and the remainder, with its log 2 « digits must be saved for the 
iterative computation. Thus the computation can be done if one 
stores log 2 n digits for each of a , b, x , y, and r, where 

r(y) = the remainder of ^ — - 

n 

which can be obtained from a register containing jc and r by 
adding b — a at each step: 

overflow 

>r~^ 

1 r 

Thus one can test for convexity by using the order of § log 2 \R\ 
squares. There is an evident redundancy here since (for example) 
a can be reconstructed from the other four quantities, and this 
suggests that with some ingenuity one could get by with just 
(2 + e) log 2 \R\. 

In any case we have no idea of how to establish a lower bound. 
Although convexity is simpler than connectivity in that it is 
conjunctively local , this is no particular advantage to the Turing 
machine, which is well suited for recursive computation, and this 
simplicity is quite possibly balanced by the complication of the 
metric calculation. So we are inclined to conjecture that both 
^convex and ^connected require about 2 log 2 |/*| tape squares for 
realization by Turing machines. We regard our inability to prove 
a firm lower bound as another symptom of the general weakness 
of contemporary computation theory’s tools for establishing min- 
imal computational complexity measures on particular algo- 
rithms. 

9.4 Connectedness and Parallel Machinery 

We have seen that there exists a Turing machine that can compute 
^connected with very little auxiliary storage in the form of memory 
tape. The computation requires an appreciable amount of time, 
or number of steps of machine operation. The number of Turing- 
machine steps appears to be of the order of \R\ log \R\ for 




Geometric Predicates and Serial Algorithms 9.4 [143] 


reasonably regular figures (for “bad” figures there may be a term 
of order |/?| 2 log \R\). On the other side, the Turing machine 
requires a remarkably small amount of physical machinery, which 
is used over and over again in the course of the computation. 

If one has more machinery, one should be able to reduce the 
number of time steps required for a computation, but we know 
very little about the nature of such exchanges. In the case of 
realizing ^connected* one can g ain time subdividing the space 
into regions and computing, simultaneously, properties of the 
connectivity within the regions. For example, suppose that we 
had machines capable of establishing, in less than the time neces- 
sary to compute ^connected f° r the whole retina, a “connection 
matrix” for boundary points on each quadrant. In the figure, this 
means knowing that a is connected to a ', b to b\ and so on. 



The connectedness of the whole can then be decided by an algo- 
rithm that “sews” together these edges. 

If the mesh is made finer, the computations within the cells be- 
come faster, but the “sewing” becomes more elaborate. On the 
other hand, the subdivision can probably be applied recursively 
to the sewing operation also, and we have not studied the possible 

exchanges. We can find an interesting upper bound for an extreme 
case: Suppose the machine is composed entirely of Boolean func- 
tions of two arguments; then how much time is required for such 




[144] 9.4 Geometric Theory of Linear Inequalities 


a machine to compute ^connected » assuming that each Boolean 
operation requires one time unit? 

Suppose, for convenience, that R has \R\ = 2 n squares (points). 
Certain pairs of points are considered to be “adjacent,” and by 
chaining this relation we can describe ^connected by a particularly 
compact inductive definition; we write 

C\j{X) = \x t A Xj A (x/is adjacent to jc 7 -)1 

and 

1*1 

CTj + \X) = V CT k (X) A CTj(X). (1) 

k= 1 

Each point x,- is considered to be connected to itself, so that 
Cu(X) = \x t e X]. Then it can be seen inductively that C™(X) 
is true if and only if x,- and x y are connected by a chain of <2 m of 
adjacent points, all in X. The whole figure is connected, that is, 
^connected W = 1, if Clj(X) = 1 for every pair for which x, e X 
and Xj € X. Hence 

CONNECTED = f X i A Xj >C jj(X) 1 

1*1 1*1 

= A A [jc, Vxj VC7j(X)]. (2) 

/=1 j=\ 

This function can be composed in a machine with a separate 
layer for each level of CTj . To connect C™ +l to the appropriate 
CT/s requires bringing together up to |/?| terms, using Equation 
1, and this requires a tree of or\ of at most n = log 2 \R\ layers 




Geometric Predicates and Serial Algorithms 9.4 [145] 


in depth. There are n such layers so the total time to compute 
C"j is of the order of n 2 . Using Relation 2, we find the final com- 
bination requires about 2 n layers so we have 

time (\^ CONNECTED ) < (log|i?|) + k • log |/?|, 

where k is a small constant.* 

We doubt that the computation can be done in much less than the 
order of (log |R |) 2 units of time , with any arrangement of plausible 
computer ingredients arranged in any manner whatever. Notice that 
we were careful to count the delay entailed by the or opera- 
tions. If this is neglected the computation requires only log \R\ 
steps, but this is physically unrealistic for large \R\. Indeed, 
we really should prohibit the unlimited “branching” or copying 
of the outputs of the elements; if the amplifiers that are physically 
necessary for this are counted we have to replace our estimate by 
3(log |R|) 2 . As usual, we have no firm method to establish the 
lower bound. However, the following pseudoproof seems rele- 
vant: 

1. Using more “ memory ” in the machine doesn’t seem to help. 
Can the machine be speeded-up by storing a library of connected 
figures and identifying them rather than working out the defini- 
tion of connectivity each time? The extreme: build a library of 
all connected figures on R. A tree of binary Boolean operators 
can be built to match any pattern in just log |jR| time steps. This 
greatly speeds up the analogue of part 1 above. But there are 
so many different connected figures that one has now to or 
together of the order of 2 0|/?l terms (where d is some fraction 
§ < 0 < 1) so the analogue of part 2 takes log (2* |/?l ) = 0 |R| 
steps, which is worse than (log |R|) 2 for large R. Of course this 
is not a proof, but we think it is an indication. 

2. Using loops cannot increase speed. The (log |R|) 2 machine is a 
loop-free hierarchy of Boolean functions — it has no “serial” 
computation capacity except that which lies in its layered struc- 
ture. 

One could vastly reduce its number of parts (of which there are 
the order of |/? | 3 - log \R\) by making a network with loops: in- 


*This construction was suggested to us by R. Floyd and A. Meyer. 



[146] 9.4 Geometric Theory of Linear Inequalities 


deed we could build a Turing machine that would have only 
k log \R\ parts, for some modest k. But, for a given computation 
of bounded lengths, the fastest machine with loops cannot be 
faster than the fastest loop-free machine (ignoring branching 
costs). For one can always construct an equivalent loop-free 
machine by making copies of the original machine — one copy 
for each computation step — with all functions taking arguments 
from earlier copies. 




3. The connection-matrix scheme seems hard to improve. There 
exist figures with nonintersecting paths of length the order of |/?|. 
It seems clear that any recognition procedure using two-argument 
functions requires at least log |/?| steps, because one cannot do 
better than double the path length at each step, as does our Cy 
connection-matrix method. At each such step there must be of the 
order of \R\ alternative connections that must be or-ed together. 
Perhaps the proof could be completed if one could show that 
nothing can be gained by postponing these or' s, so that each re- 
quires log | R | logic levels. 


9.5 Connectedness in Iterative Arrays 

Terry Beyer has investigated the time necessary to compute 
^ connected m a situation that provides a different and perhaps 
more natural model for parallel geometric procedures. Suppose 
that each square of a retina contains an automaton able to com- 
municate only with its four neighbors. It can also tell the state 
(black or white) of its square. The final decision about whether 
the figure is connected or not is to be made by some fixed automa- 
ton, say the one in the top left-hand corner. On the assumption 



Geometric Predicates and Serial Algorithms 9.5 [147] 


See. Weddell I. ' Recoin f’hoia 

To^c?\oa\C<x[ IWairuiw+s bv^ ItevVhue 
fVO. ^issev^tiottj/M.r^ ^Ocf- (*76T. 


that the states change only at fixed intervals of time, we ask how 
many time units must pass before the decision can be made. It is 
obvious that on an n x n retina this will take at least In time 
units, for this is the time required for any information to pass 
from the bottom right corner to the top left. It is not difficult 
to design arrays of automata that will make the decision in the 
order of n 2 (that is, |/? |) time units. Beyer’s remarkable result 
is that (2 + e)n is sufficient, where e can be made as small as 
one likes by allowing the automaton to have sufficiently many 
states. 


Thus the order of magnitude of time taken by the array is propor- 
tional to VWU which is (naturally) intermediate between the 
times taken by the single serial machine ( \R\) and the unrestricted 
parallel machine which is known to take <(log |/?|) 2 . 

The following gives an intuitive picture of Beyer’s (unpublished) 
algorithmic process. The overall effect is that of enclosing a com- 
ponent in a triangle as shown below, and slowly compressing it 
into the northwest corner by moving the hypotenuse inward. 



Each component is compressed to one isolated point before 
vanishing. Whenever this event takes place it can be recognized 
locally and the information is transmitted through the network 
to the corner. Thus the connectedness decision is made positively 
or negatively depending on whether such an event happens once 
or more than once. More precisely, the compression process starts 
by finding the set of all “southeast corners” of the figure. 



The center square is a SE corner if the South 
and East are empty. All other squares shown 
may be empty or full. 




[148] 9.5 Geometric Theory of Linear Inequalities 


In the compression operation, each SE corner is removed, while 
inserting a new square when necessary to preserve connectedness 
as shown in the next figure: 

X T(x) 




because this would 
break the connection: 



The diagonal lines show how repetition of this local process does 
squeeze the figure to the northwest. 

Repeated applications of T eventually reduce each component to 
a single point. The next figure shows how it (narrowly but effec- 
tively) avoids merging two components. 



It is easy to see that a component within a hole will vanish (and 
be counted) just in time to allow the surrounding component to 
collapse down. We do not know any equivalent process for three 
dimensions. (Consider knots!) 



LEARNING THEORY 




[150] Learning Theory 


Introduction to Part III 

Our final chapters explore several themes which have come, in 
cybernetic jargon, to be grouped under the heading “learning.” 
Up to now we discussed the timeless existence of linear represen- 
tations. We now ask how to compute them, how long this takes, 
how big they are, and how efficient they are as a means of storing 
information. A proof, in Chapter 10, that coefficients can grow 
much faster than exponentially with |7?| has serious consequences 
both practically and conceptually: the use of more memory ca- 
pacity to store the coefficients than to list all the figures strains 
the idea that the machine is making some kind of abstraction. 

Chapter 1 1 clarifies the remarkable perceptron convergence 
theorem by relating it to familiar phenomena associated with 
finite-state machines, with optimization theory and with feedback 
as a computational device. 

Chapter 12 abandons the strict definition of the perceptron to 
study a larger family of algorithms based on local partial predi- 
cates. These include methods (like Bayesian decisions) used by 
statisticians, as well as ideas (like hash coding) known only to 
programmers. Its aim is to indicate an area of computer science 
encompassing these apparently different processes. We dramatize 
the need for such a theory by singling out a simply stated un- 
solved problem about the more direct and commonly advocated 
methods for the storage and retrieval of information. 



Magnitude of the Coefficients 

10 


10.1 Coefficients of the Parity Predicate 

In §3.1 we discussed the predicate i/wityC^) = \\%\ * s an °dd 
numberl and showed that if <£ is the set of masks then all the 
masks must appear in any L(4>) expression for imparity- O ne suc h 
expression is 

tfWvW = fS(-2) |S( "V(Y) < -11 


which contains all masks of <i> with coefficients that grow ex- 
ponentially with the support-size of the masks. We will now show 
that the coefficients must necessarily grow at this rate , because 
the sign-changing character of parity requires that each coefficient 
be large enough to drown out the effects of the many coefficients 
of its submasks. In effect, we show that ^ PARIT Y can be realized 
over the masks only by a stratificationlike technique! So suppose 
that we have i^ PARITY = f 2 a/ <Pi > 01. Suppose also that the group- 
invariance theorem has been applied to make equal all a s for all 
(p's of the same support size, and suppose finally that the dis- 
crimination of \p PAR1TY is “reliable,” for example, that > 1 

for odd parity and < 0 for even parity. Then we obtain 

the inequalities 


Oi i > 1, 
a 2 + 2a\ < 0 , 
0^3 ~b 3 q?3 + 3 OL\ ^ 1, 

a 4 -I- 4a: 3 -f 6 a 2 + 4 ct\ < 0, 


by applying the linear form to figures with 1,2,3,... points. 
The general formula is then obtained by noticing the familiar 
binomial coefficients, and proving by induction that 

>1 if n is odd, 

<0 if n is even. 




[152] 10.1 Learning Theory 


Next, by subtracting successive inequalities, we define 


D„ = £ 


n + 1 




= a „+ i 




n + l\ 


?,:■ «■ 


«/ = «n+l + 


so that for all n = 0, 1, 2, . . . 

(-1 ) n D n > 1. 

Using these inequalities, we will obtain a bound on the coeffi- 
cients {a,). We will sum the inequalities with certain positive 
weights; choose any M > 0, and consider the sum 


E M (-im>E f -2- 


The left-hand side is 


M l 

IZ(-D' 


/ A M' 

U Mi i 


m m 




yw m 

= IEh )' 


M M 

ZIh )' 


A: ! (/ - Jfc)! j \/!(M - /)! 


M! \/ (M - k)\ 
k\(M - k)\)\(i - k)\(M - /)! 


& / ’ - k - j)\ 


= Z « t + , Y (-!)*(! - l)' 


= «A/+l(— 1)^, 



Magnitude of Coefficients 10.2 [153] 


so we obtain 

I^AZ+ll > 2 M . 

Theorem 10.1: In any “reliable” realization of \js PAR1TY as a linear 
threshold function over the set of masks, the coefficients grow at 
least as fast as 2 I5( * ,)I ~ I . 

These values hold for the average, so if the coefficients of each 
type are not equal, some must be even larger! This shows that 
it is impractical to use masklike <p’s to recognize paritylike func- 
tions: even if one could afford the huge number of <p’s, one would 
have also to cope with huge ranges of their coefficients! 

remark: This has a practically fatal effect on the corresponding learning 
machines. At least 2 |/?i instances of just the maximal pattern is required 
to “learn” the largest coefficient; actually the situation is far worse be- 
cause of the unfavorable interactions with lower-order coefficients (see 
§11.4). It follows, moreover, that the information capacity necessary to 
store the set {«,) of coefficients is as much as would be needed to store 
the entire set of patterns recognized by ^ PARITY — that is, the odd subsets 
of \R\. For, any uniform representation of the a/s must allow \R \ — 1 
bits for each, and since there are 2^ coefficients the total number of 
bits required is |/£ | • 2 |/?l ~ *. On the other hand there are2 |/?l_1 odd sub- 
sets of R, each representable by an |/?|-bit sequence, so that \R \ -2 |/?l_1 
bits would also suffice to represent the subsets. And the coefficients in 
§10.2 would require much more storage space. 

It should also be noted that Verity ls not ver y exceptional in this regard 
because the positive normal-form theorem tells us that all possible 2 2 '* 1 
Boolean functions are linear threshold functions on the set of masks. So, 
on the average , specification of a function will require 2 |/?l bits of coeffi- 
cient information, and nonuniformity of coefficient sizes would be ex- 
pected to raise this by a substantial factor. 

10.2 Coefficients Can Grow Even Faster than Exponentially in \R\ 

It might be suspected that ^p ARITY is a worst case both because 
(1) parity is a worst function and (2) masks make a worst <£. In 
fact the masks make rather a good base because coefficients over 
masks never have to be larger than \a t | = 2 1 S(iPi) 1 , as can be seen 
by expanding an arbitrary predicate into positive normal form. We 
now present a new predicate \p EQ , together with a rather horrible 
<J>, that leads to worse coefficients. Let R be a set of points, 
y \, . . . ,y n , zj, , . . , z„ and let { Y t ] and {Z,} each be enumerations of 



[154] 10.2 Learning Theory 


the 2 n subsets of the y ’ s and z’s, respectively. Then any figure 
X C R has a unique decomposition X = Y } U Z k . 

We will consider the simple predicate \p EQ , 


u Z k ) = \j = * 1 , 

which simply tests, for any figure X , whether its Y and Z parts 
have the same positions in the enumeration. The straightforward 
geometric example is that in which the two halves of R have the 
same form, and T, and Z, are corresponding sets of y and z 
points. 

We will construct a very special set <£ of predicates for which 
\p EQ eL($) and show that any realization of \p EQ in L(4>) must in- 
volve incredibly large coefficients! 

We want to point out at the start that the 4> we will use was designed for 
exactly this purpose. In the case of imparity we saw that coefficients can 
grow exponentially with the size of \R |; in that case the <t> was the set of 
masks, a natural set, whose interest exists independently of this problem. 
To show that there are even worse situations we construct a <t> with no 
other interest than that it gives bad coefficients. 

We will define <J> to contain two types of predicates: 

MYj U Z k ) = '[/ = k], 

Xi (Yj U Z k ) = \(j = k A / = *) V (j = k - 1 A i < k) 1, 

each defined for / = 1,...,2”. Note that |5(^/)| = n and 
|S(X/)I = 2 n. First we must show that \[/ EQ e L(4>). But consider 
the formula 

eq = [22^, - X/) < 11- 

Case I: j = k 

Then i p k = 1 and Xk = L hence \p EQ = \2 k (\ - 1) < 11 is true. 

Case II: j ^ k and j ^ k - 1 

Then only \p k = 1 and \p EQ = \2 k < 11 is false. 

Case III: j = k - 1 

Then \p k = 1 and x / = 1 for / = 1 - 1 . So 



Magnitude of Coefficients 10.2 [155] 


k - 1 


2 * - ^ 2 ' < 1 


[2 < 1] is FALSE. 


and the predicate holds only for the j = k case, as it should. So 
\p EQ is indeed in L(<F). 


Now we establish bounds on the coefficients. Consider any ex- 
pression 


\p EQ = [2a/X/ + 2 > 01. 


Then for sets T^+i U Z^weget/^ < 6 , 

for sets Fyt U Z^wegeta^ + /?* > 6 + 1, {strong separation) 
for sets T*_i U Z* we get a l 4- • • • + a k _ j 4- < 6. 

We can set 0 = 0 by subtracting it from every /?, since just one (3 
appears in each inequality. So /5j < Oand^] > 1. And since 


a k > 1 +«! + ••• + a k _ i 


we have immediately a 2 > 2, a 3 > 4, . . . , a y > 2 7 \ Because the 
index j runs from 1 to 2”, the highest a must be at least 2 2 
times as large as the initial separation term {a x + fi { ) - /3 { = 
This incredible growth rate is based in part on a mathematical 
joke: we note that an expression “y = A:” equivalent to that for 
iAeq appears already within the definitions of the x/s and it is 
there precisely to not-quite-fatally weaken their usefulness in 

Lm 

Ironically, if we write \p EQ in terms of masks we have 

'Peq = r S(^i + z, - 2 ytZi < 1)1, 

and the coefficients are very small indeed! 

problems: Find a <£ that makes the coefficients of iA PARIT y grow like 
2 . Solution in §10.3. In §10.1, $ has 2 1 1 elements and iA parity 

requires coefficients like 2 i/? L In §10.2 $ has elements, but the 

coefficients are like 2 2 ^ . It is possible to make ^’s with up to 2 2 * * 
elements. Does this mean there are \ A’s and <F’s with coefficients like 
2 2 ? (We think not. See §10.3.) 



[156] 10.2 Learning Theory 


Can it be shown that one never needs coefficient ratios larger than 2 ^ 
for any <t>? Can we make more precise the relations between coefficient 
sizes and ratios. Can it be shown that the bounds obtained by assuming 
integer coefficients give bounds on the precisions required of arbitrary 
real coefficients? Can you establish linear bounds for coefficients for the 
predicates in Chapter 7? 

The linear threshold predicate 


*eq = [2 2' (ypi - X i) > 0] 


is very much like those obtained by the stratification-theorem 
method, in that at each level i the coefficient is chosen to dominate 
the worst case of summation of the coefficients of preceding 
levels. The result of theorems of §10.1 and §10.2 is that for those 
predicates there do not exist any linear forms with smaller co- 
efficients, and this suggests to us that (with respect to given <F’s) 
perhaps there is a sense in which some predicates are inherently 
stratified. We don’t have any strong ideas about this, except to 
point out that there is a serious shortage of computer-oriented 
concepts for classification of patterns. We do not know, for most 
of the cases in Chapter 7, which of them really require the strati- 
ficationlike coefficient growth: that is to say, we don’t have any 
general method to detect “inherent stratification.” 

10.3 Predicate With Possibly Maximal Coefficients 

Define \\X || to be the index of X in an ordering of all the subsets of 
/?. We will consider the simple predicate ^ || PARITY || = MU II is oddl 
with respect to the following set <f> of predicates: 

(0 if IU|| < i, 

<Pi(X) HI if lull = /, 

[(11*11 - /) mod 2 if II *11 > /. 

Then \p h parity l! is in U$) and is i n fact realized by 

1 P || PARITY II ~ U (“ 0 fiPi < 01? 

where / is the zth Fibonacci number (/„ = /„_ i + f n -iY 
\ft j = {1, 1,2, 3, 5, 8, 13,... j. 



Magnitude of Coefficients 10.3 [157] 


Theorem 10.3: Any form in L($) for \p n PAR1TY< must have co- 
efficients at least this large; since the / grow approximately as 


i (vj + iY 

VT\ 2 ) 

the largest coefficient is then of the order of magnitude of 
— 2“ ^ where a = iog 2 ( V ^ 2 + ' ) 

The proof of the theorem can be inferred by studying the array 
below: 

II A',- 1| 


/ 


1 

2 

3 

4 

5 

6 

7 

8 

9 ... 

1 

-1 

i 

1 

0 

1 

0 

1 

0 

1 

0 ... 

2 

+ 1 

0 

1 

1 

0 

1 

0 

1 

0 

1 

3 

-2 

0 

0 

1 

1 

0 

1 

0 

1 

0 ... 

4 

+ 3 

0 

0 

0 

1 

1 

0 

1 

0 

1 

5 

-5 

0 

0 

0 

0 

1 

1 

0 

1 

0 ... 

6 

+ 8 

0 

0 

0 

0 

0 

1 

1 

0 

1 

7 

-13 

0 

0 

0 

0 

0 

0 

1 

1 

0 ... 


It can be seen that if a, < 0 and the coefficients are integers then 

C ' 

•*2, + i < ~ T, a 2j 

7- i 

l 

°2i 

V y-l 

and the reader can verify that this implies that for all a h 

I oti + 1 | > \aj | + |a,-_i | ; 

hence \a, | > /. 

Discussion and conjecture: This predicate and its $ have the same 
quality as that in §10.2 — that the v?’s themselves are each almost 
the desired predicate. Note also that by properly ordering the sub- 



[158] 10.3 Learning Theory 


sets, we can arrange that 

’A || PARITY || = ^A PARITY 

We conjecture that this example is a worst case: to be precise, if 
4> contains |<L| elements, the maximal coefficient growth cannot 
be faster than 

1*1 

where the exponent constant is the Fibonacci, or golden-rectangle, 
ratio. Our conjecture is based only on arguments too flimsy to 
write down.* 

10.4 The Group-Invariance Theorem and Bounded Coefficients on 
the Infinite Plane 

In §7.10 we noted a counterexample to extending the group- 
invariance theorem (§2.3) to infinite retinas. The difficulty came 
through using an infinite stratification that leads to unbounded 
coefficients. This in turn raises convergence problems for the 
symmetric summations used to prove equal the coefficients within 
an equivalence class. If the coefficients are bounded, and the 
group contains the translation group, we can prove the cor- 
responding theorem. (We do not know stronger results: presum- 
ably there is a better theorem with a summability-type condition 
on the coefficients and a structure condition on the group.) The 
proof uses the geometric fact that for increasing concentric circles 
about two fixed centers the proportion of area in common ap- 
proaches unity. 



*Such as the fact that vT occurs in upper bounds in the theories of rational 
approximations and geometry of numbers. 



Magnitude of Coefficients 10.4 [159] 


10.4.1 Bounded Coefficients and Group Invariance 

Let \p be a predicate invariant under translation of the infinite 
plane. 

Theorem 10.4.1: If the coefficients of the <^s are bounded in each 
equivalence class, then there exists an equivalent perceptron with 
coefficients equal in each equivalence class. 

proof: Let T c be the set of translations with displacements less 
than some distance C. Let \p = \^a((p)<p > 6}. Now define 

tc(X) = 

- X *>00 Z «(**■') > X ° 

* T c T c 

= X^(^)X«(^)- x^. 

* T c T c 

because T c is carried onto itself under the group inverse. By the 
argument df §2.3 each \ p c is equivalent to ^ as a predicate. The 
following lemma shows that we can select an increasing sequence 
R u R 2, ... of radii for which the limit 


XX a (<p)<fi(gX) - d) > 0 


lim 



X «(<?£) 

g( Tr i 


has the same value independent of <p within every equivalence 
class. 

Lemma: Suppose some function f(x) is bounded, that is, 
\f{x)\ < M , in E 2 . Then there exists a sequence of increasing 
radii R> such that for any system of concentric circles with these 
radii, the value of 


lim — ^ Ur f f(y)dA 

j ^ x ■jf ft J J 

I y-p I < R i 

will be the same, independent of the selection of the common 
center /?,- if t - fre limit ex i sts for an y e efrter a -t-a-H-. 



[160] 10.4 Learning Theory 


PROOF: Let jp b e+he Since ol|( v/^ltces o(fli6 I 

h'e m +be m-feir^l any \h-fiwife se^oznce of 

tViem wnAs-f' hq\/^ dM Mo-finifc co n i/-e (r<{ ent" sob- segue, nee, 
choose suck ct c<?Ku/e ^uJ&s^gL'ence 4ir<?m^ 
tbe enrcles luiH v^ql J ? 2 l. J 2. J 3 J -efc. L c + 

5ucIl cl sequence , 

HPRO or: Choose as center the origin - and any sequence of~R/s i n- 
crea s ing wit - bout botmdr Then for each / we have 


^ f Mda 

\y\ <Ri 


< M. 


Given any other center p for the circles, note that 


J Ay) dA 

I v | <R, 


J f(p + y)dA 

I v | </?, 


< 2-M-A,(/?), 


where A,(/?) is the area of nonoverlap between the two disks 
| y | < /?,and \y— p | < R im But as the radius grows, for any p 


lim 



= 0 


so the two sequences approach the same limit -fif-afty)-. 


Q.E.D. 


To prove the main theorem, we simply choose a representative ip 
from some equivalent class, and set / (g) = a(ipg), regarding g as 
a translation from the origin. 

It follows that the perceptron obtained in §7.4 must have un- 
bounded coefficients, and there is no equivalent representation in 
L(4>) with bounded coefficients. 

T he limi - t -- require 4-- by Lh€ - 4€mm ^- may - ne t exist; - in - fact it i s - e asy - 
- to construct counterexamples. Prob a biy r th - is me a ns - that Theorem 
10.4.1 is not strictly true, but we do not think th e e xc e ptions ar e 
im portant. W e - do - not - know a d e finit e counterexample - to ■ the - 
-the orem, as stated. 

Note: The methods of §10.2 and §10.3 are similar to those used by 
Myhill and Kautz [1961] to find maximal coefficients for the 
order- 1 case. They show that with integer coefficients there is an 
order- 1 predicate for which some coefficient must exceed 2/e- 
1 /n • 2 n . 



Learning 

11 


11.0 Introduction 

In previous chapters we used no systematic technique to find a 
representation of a predicate as a member of an L(<F). Instead, we 
always constructed coefficients by specific mathematical analysis 
of the predicate and the set 3>. These analyses were ad hoc to each 
predicate. In this chapter we study situations in which sets of co- 
efficients can be found by a more systematic and easily mechanized 
procedure. It is the possibility of doing this that has given the 
perceptron its reputation as a “learning machine.” 

The conceptual scheme for “learning” in this context is a machine 
with an input channel for figures, a pair of yes and no output 
indicators, and a reinforcement or “reward” button that the 
machine’s operator can use to indicate his approval or disap- 
proval of the machine’s behavior (see Figure 11.1). The operator 




has two stacks F + and F“ of figures and he would like the ma- 
chine to respond yes to all figures in F + and no to all figures in 
F - . He indicates approval by, say, pressing the button if the re- 
action is correct. The machine is to modify its internal state better 
to conform to its master’s wishes. 

There are many ways to build such a machine. The most obvious scheme 
is to have some kind of recording device to store incoming figures in two 
separate files, for F + and F~. This kind of machine will never make a 
mistake on a previously seen figure but, along with its never-forgetting, 
it brings other elephantine characteristics. Another, very different kind of 
machine would attempt to find descriptive characteristics that distinguish 
between the figures of the two classes, and to use new figures to sharpen 
and elaborate these descriptions. This kind of machine would, in the long 
run, require less memory but its mechanism and its theory are both much 
more complicated. If the classes F + and F” are very large then the first 





[162] 11.0 Learning Theory 


machine is disbarred; if there is no description within the practical 
repertoire of the second machine, it will fail. 

The perceptron, as a pattern-discriminating machine, lies between these 
two paradigms. It is not a pure memory-matching machine, for it does 
not store the pictures. As a description-machine its repertoire is limited 
(as we have seen in the previous chapters) to what can be done with 
“local” features of the patterns and only linear threshold relations 
between these features. The existence of the simple learning procedures 
described below results from this restriction on the machine’s descriptive 
power (and could be regarded as a partial compensation for this limita- 
tion). 

Let us suppose that the machine contains a perceptron with a 
fixed <i> and adjustable coefficients. When a figure X is presented 
the sum 

2 aM * 0 

is computed. If X belongs to F + and this sum is positive, the 
machine responds yes and all is well. If X belongs to F + but the 
sum is negative, the machine responds no. This is bad, and some- 
thing must be done. What is the simplest possible correction 
procedure? 

The first idea that comes to mind, especially to people who have 
grown up on the idea of feedback, is the following: Since the 
sum was too small, let’s increase its coefficients. If it had come out 
too large (namely, response yes for a figure in F“), we would 
decrease coefficients. 

But we must adjust the coefficients in a reasonable manner, so 
that the feedback effect is directed properly. 

Suppose that 2a^(J) comes out negative for an X in F + . In 
general some <p's give zero values for < p(X ), and their coefficients 
clearly cannot be blamed for the bad total. In fact, changing these 
coefficients might do harm in relation to other X's and does no 
good in relation to the current X. Thus we should increase a ^ only 
if <p(X) = 1. We should like a procedure for doing this whose 
mathematical form is clear enough to allow simple analysis and 
whose power is great enough to yield a reasonable success. The 
procedure given in §11.1 achieves both these goals, but we will 
first make a few introductory remarks. 



Learning 11.0 ] 1 63] 


11.0.1 Coefficients and Vectors 

It is convenient to think of the set of coefficients { a fP \, ordered 
in an arbitrary but fixed way, as a vector in |<£|-dimensional 
space. Denote this vector by A. Similarly the set [<p(X)\, ordered 
in the same way, can be taken as a vector whose components are 
the values of the <p(Xy s. We denote this vector by <£(JT). Now the 
operation of increasing those coefficients that correspond to non- 
zero values of <p(X) is neatly performed by merely adding the 
vector <t>(V) to the vector A. If the sum had come out positive 
for X in F ~, we would subtract #(V) from A. 

A priori , any procedure of this sort runs the risk of oscillating 
wildly. An adjustment of the coefficients in the appropriate direc- 
tion for one figure might undo the previous adjustment for an- 
other figure. Thus our intuition about whether it will work is in- 
fluenced by two conflicting ideas drawn from experience with 
cybernetic situations: simple error-correcting feedback does often 
work; on the other hand, the process involves a search in a |T| - 
dimensional space and our experience with other schemes for 
“hill-climbing” makes us acutely aware of the dangers that beset 
such procedures. Close analysis is needed. 

This question of whether simple feedback will work can be posed in other 
words closely related to our main theme. The condition to be satisfied 
by the set of coefficients { a is defined globally in relation to the entire 
set of figures. On the other hand the “correction” procedure is highly 
local in the sense that each change made to the current values of these 
coefficients is based on consideration of just one figure. Thus the prob- 
lem of finding conditions under which the procedure will make the a ^ 
converge to globally satisfactory values belongs to the study of the rela- 
tion between apparently global and apparently local computations. 

In this chapter we will show that very small refinements will turn 
the simple feedback principle into a workable “training” or error- 
correction procedure. The main theorems about this are already 
fairly well known. Our main concern will be to understand why it 
works. By analyzing it from several points of view, its mechanism 
will become transparent and its logic obvious. 

In our discussion of recognizability of figures we have tried to 
replace vague formulations of questions about whether percep- 
trons are “good” or “bad” recognizers by an analytic theory that 
shows why perceptrons succeed in some cases and must fail in 



[164] 11.0 Learning Theory 


others. Although we do not have an equally elaborated theory of 
“learning,” we can at least demonstrate that in cases where 
“learning” or “adaptation” or “self-organization” does occur, its 
occurrence can be thoroughly elucidated and carries no sugges- 
tion of mysterious little-understood principles of complex sys- 
tems. Whether there are such principles we cannot know. But the 
perceptron p r o vides no evid e n c e; and ' TT T rr 3uecc3s - i r r analyzing"!! 
adds another piece of circumstantial evidence for the thesis that 
cybernetic processes that work can be understood, and those 
that cannot be understood are suspect. 

11.1 The Perceptron Convergence Theorem 

Consider the following program in which the vector notation 
A • is used in place of our usual “2 notation. 

start: Choose any value for A. 

test: Choose an X from F + U F. 

If X e F + and A • > 0 go to test. 

If X e F + and A • < 0 go to add. 

If X e F and A • $ < 0 go to test. 

If X e F and A • <F > 0 go to subtract. 

add: Replace A by A 4- &(X). 

Go to TEST. 

subtract: Replace A by A - ^(A'). 

Go to TEST. 

We assume until further notice that there exists a vector, A*, with 
the property that if Xe¥ + then A* • ^(A') > 0 and if X e F 
then A*-^>(A r ) < 0. The perceptron convergence theorem then 
states that whatever choice is made in start and whatever choice 
function is used in test, the vector A will be changed only a finite 
number of times . In other words, A will eventually assume a value 
A 0 for which A 0 • always has the proper sign, that is, the 

predicate 

f = [A 0 • $ > 01 
will have the property: 

X e ¥ + implies f(X) = 1, 

X e ¥~ implies f(X) = 0. 



Learning 11.1 [165] 


This is often expressed by saying that the predicate \p(X) separates 
the sets F + and F~. The convergence theorem can be ^loosely 
stated as: if the sets are separable (that is, if there exists a “solu- 
tion” vector A*), then the program will separate them (that is, 
it will find a solution vector A 0 which may or may not be the 
same as A* ). 

Because we are now concerned more with the sets of coefficients 
[ccp] than with the nature of itself or the geometry of figures 
in R , it will be convenient to think of the functions in L(< f>) as 
associated with the sets {aj regarded as vectors whose base 
vectors are the <^’s in <F. Warning: the vector-space base is tne 
set of y s, and not the points of R\ Although in this chapter we 
will think of the forms as elements of a vector space, one 

should remember that the set L($) of ^’s isn’t a vector space, and 
that each \p e L(<f>) can be represented by many A vectors. f 

In this vector-space context, the classes F + and F of figures are 
mapped into classes of vectors, which we will still call F + and F 

The mapping from pictures to vectors may, of course, be degen- 
erate, for we could have two figures X ^ X' for which $(X) = 
<£(A"'): the original figures are “seen” only through the <£>’s, and 
some details can be lost. 

We will now discard the restriction on the <?’s that their values 
be either 0 or 1. The ^-functions may now take on any real, 
positive or negative values and, for different A°s, each y? may 


tit may be observed that vector geometry occurs only here and in Chapter 12 of 
this book. In the general perceptron literature, vector geometry is the chief 
mathematical tool, followed closely by statistics — which also plays a small role 
in our development. If we were to volunteer one chief reason why so little was 
learned about perceptrons in the decade that they have been studied, we would 
point toward the use of this vector geometry! For in thinking about the 
2a,-f/s as inner products, the relations between the patterns \X\ and the 
predicates in L(<F) have become very obscure. The A-vectors are not linear 
operators on the patterns themselves; they are “co-operators,” that is, they 
operate on spaces of functional operators on the patterns. Since the bases- 
^-classes — of their vector spaces are arbitrary, one can’t hope to use them to 
discover much about the kinds of predicates that will lie in an L(T). The 
important questions aren’t about the linear properties of the /.(<£)’ s, but about 
the orders of complexities in computing pattern qualities from the information in 
the \<p{X)\ set itself. 



[166] 11.1 Learning Theory 


have any number of different values. So we can think of F + and 
F~ as two arbitrary clouds of points in <J>-space. 


The main danger in allowing this generality is that the feedback pro- 
cedure might be overwhelmed by vectors too large or stalled by vectors 
too small, so instead of adding or substracting 3> itself, we will later 
use instead the unit-length vector in the same direction: 

$ = so that |i| = 1. 

1*1 

If the sets F + and F“ are infinite the angles between pairs of vectors, 
one from each set, can have zero as a limit. In that case there is only one 
solution -veete^and the program may not find it. The conditions of 
Theorem 11.1 will exclude this possibility. 


The case-analysis in test of the program just described is over- 
complicated. The following program has the identical behavior: 


start: Choose any value for A 0). 

test: Choose a <f> from F + U F". 

If $ e F + and A • & > 0 go to test. 

If e F + and A • < 0 go to add. 

Replace * by 

If $ t F" and A • * > 0 go to test. 

If $ 6 F~ and A • ^ < 0 go to add. 

add: Replace A by A + $. 

Go to TEST. 


This is equivalent because (1) we have reversed the inequality 
signs in the part of test following changing <F, so all decisions 
will go the same way; (2) the effect of “go to add” is the same 
as “go to subtract” with reversed sign of <i>. Now, “replace 
by — is executed if and only if $ f F" and since the in- 
equality conditionals now have identical outcomes we can replace 
the program by the still equivalent program: 



Learning 11.1 [167] 


start: Choose any value for A. 

test: Choose a $ from F + U F~; 

If $ e F" change the sign of <F. 

If A • $ > 0 go to test; 
otherwise go to add. 

add: Replace A by A + <F. 

Go to TEST. 

In other words the problem of finding a vector A to separate two 
given sets F + and F~ is not really different from the problem of 
finding a vector A that satisfies 

$ ( F =► A • > 0 

for a single given set F, defined as F + together with the negatives 
of the vectors of F~. 

We use these observations to simplify the program and statement 
of the convergence theorem : for simplicity we will state a version 
that uses unit vectors. 

Theorem 11.1: Perceptron Convergence Theorem: Let ¥ be a set 

of unit-length vectors . If there exists a unit vector A* and a number 
<5 > 0 such that A* • > 8 for all in F, then the program 


start: Set A to an arbitrary <i> of F. 

test: Choose an arbitrary 4> of F, and 

if A • $ > 0 go to test; 
otherwise go to add. 

add: Replace A by A + <F. 

Go to TEST. 

will go to add only a finite number of times. 

Some readers might be amused to note that the proof of this theorem 
does not use any assumptions of finiteness of the set F or the dimension 
of the vector space. This will not be true of later sections where the 
compactness of the unit sphere plays an apparently essential role. 



[168] 11.1 Learning Theory 


Corollary: We will generally assume that the program is pre- 
sented a sequence such that each f f F repeats indefinitely often. 
Then it follows that it will eventually find a “solution” vector A, 
that is, one for which 

A • <£ > 0 for all $ e F. 

This will not, of course, necessarily be A*, because A* is an 
arbitrary solution vector. All solution vectors form a “convex 
cone,” and the program will stop changing A as soon as it pene- 
trates the boundary of this cone. [Convex cone: a set S of vectors 
for which ( 1 ) a e S =* ka e S for all k > 0, (2) a e S and ft e S => 
(a + ft) e S. It is not a vector subspace because of the k > 0 
condition.] 

11.2 Proof of the Convergence Theorem 

11 . 2.1 

Define 



It may help some readers to notice that C7(A) is the cosine of the 
angle between A and A*. Because |A* | = 1, we have 

G( A) < 1. 

Consider the behavior of G(A) on successive passes of the pro- 
gram through add. 

A* • A, +l = A* - (A, + *) 

= A* • A r + A* • # 

> A* • A, + <5; 

hence after the nth application of add we obtain 

A* • A„ > nd. thesis 

Thus the numerator of (7(A) increases linearly with the number 
of changes of A, that is, the number of errors. 



Learning 11.2 [169] 


As for the denominator, since A, • <£ must be negative (or the 
program would not have gone through add) 

|A /+ i | 2 = A /+1 • A /+ ] 

= (A, + *) • (A, + *) 

= | A, | 2 + 2A f • # + |$| 2 
< |A,| 2 + 1, 


and after the nih application of add, 

| A„ | 2 < n. ANTITHESIS 


Combining the results thesis and antithesis, we obtain 
™ x A* • A„ nb 

c<A,) 'irr vn- 

But G(A) < 1, so this can continue only so long as Vn 8 < 
that is, 


n < 1 / 8 2 . 

This completes the proof. 



Figure 1 1 .2 The radial increase must be at least £4n-a moun t, yet the new 
vector must remain in the shaded region; this becomes impossible when 
the region, whose thickness varies inversely with \A\, becomes thinner 
than 8. 


[170] 11.2 Learning Theory 


Figures 11.2 and 11.3 show some aspects of the geometry of the 
rate of growth of |A|. They are particularly interesting if one 
wishes to look at the algebraic proof in the following dialectrical 
and slightly perverse form. Inequality antithesis can be read as 



Figure 11.3 The extreme case in which the bound \A\ | = \/~n is 
obtained. 

saying that |A„ | increases more slowly then the square root of n. 
On the other hand Inequality thesis can be turned (via the 
Cauchy-Schwartz inequality) into an assertion that |A„ | grows 
linearly with n. This leads to a contradiction: |A„ | must grow, 
but cannot grow fast enough. 

11.3 A Geometric Proof (Optional) 

We are given a (unit) vector A* with the property 

A* • <£ > <5 for all $ 6 F. 


This means that every vector <1> in F makes an angle 6 $ with A* 
for which cos 6* > 5. If we choose d* > 0 to be smaller than any 



Learning 11.3 [171] 


of the 0* s, then every vector V within 0* of A* has the property 
V • > 0 for all $ t F. 

Therefore any vector V within the circular cone with base angle d* 
from A* will be a solution vector that will cause the program to stop 
changing. 

Now consider the vector A computed within the program. At 
each stage A is a sum of members of F. Thus 

A* • A = A* • (#, + <F 2 + •••)> 0. 

Let this page represent the plane containing A* and A. If we take 
A* as a unit vector oriented vertically, the above inequality shows 
that A must be oriented into the upper half-plane: 



We should like to show that each time the program passes 
through add, A is brought closer in direction to A*. Unfortu- 
nately, this is not strictly true. Figure 11.4 shows, however, that 
it will “normally” happen; and we should understand this normal 
case before closing off the details to obtain a rigorous proof. 

When add is used a vector $ will be added to the current value 
of A, say A,, to obtain a new value of A, say A, + i = A, + #. We 
know two facts about 


A*$ > 0, 

A,$ < 0. 



[172] 11.3 Learning Theory 


Now consider the projection of # on the plane of the paper 
and placed with its origin at the end of A, (in preparation for the 
usual geometric picture of vector addition). The first condition 
states that the end of must be above the dotted line and the 
second condition states that it must be below the dashed line. 
Thus, it lies as shown and points from the end of A, towards 
the direction of A*. 



If we consider the right cone generated by rotating A, about A*, it 
is clear that $ itself (of which $ N is the projection) runs into 
the cone. The proof of the theorem would be complete except for 
the observation that # might leave the cone again and so allow 
A, + I to have a larger angular separation from A* than did A,. 
Figure 1 1.5 shows how this might happen. 



Figure 1 1 .5 



Learning [ 11 . 3 ] [173] 


But the overshoot phenomenon is not fatal, for it can occur only a 
limited number of times, depending on 6 *. To prove this consider 
the cone generated by rotating A about A*. Because # always has 
a vertical component <£ • A* > <5, the height of the cone increases 
each time A is changed. If the angle between A and A* remains 
greater than d* (and if not, the proof is finished!), the rim of the 
cone will come to have indefinitely large radii. Now let us look 
down, along A*, at the projection $ of <£ on the top of the cone: 
we will show that the end of $ must lie at least a distance d 
toward A* from the tangent line. Also, since |#| =1, the end of 
$ must lie inside a unit circle drawn around the end of A (see 
Figure 1 1 .6). 



Thus the end of # must lie in the shaded region. When the cone 
rim gets large enough, the shaded region will lie entirely within 
the cone, and so will 4>, and therefore also the end of <£ which is 
directly above it. So it remains only to show where the magic 
distance d comes from. 


[174] 11.3 Learning Theory 


To see this, we now look along the line tangent to the cone-rim 
through A (see Figure 11.7). Now the end of $ must lie within 
the shaded region defined by (1) the plane orthogonal to A, and 
(2) a plane orthogonal to A* and lying <5 above A again, because 



Figure 1 1.7 


A * • $ > <5. Thus, the end of $ cannot come closer than the in- 
dicated distance d to the tangent. Because A is a sum of #’s it can- 
not ever make an angle greater than - 6* with A*, and this 
gives a positive lower bound to d. So, after a finite “transient” 
period, the A’s remain in a “vertical” cylinder, which must 
eventually go inside the acceptance cone around A*. 

This proves that, eventually, A must stop changing. Thus 
Theorem 11.1 is proved. 


Learning 11.4 [175] 


11.4 Variants of the Convergence Theorem 

The convergence theorem has a large number of minor variants. 

It is easy to adapt our proof to cover any of the following forms 
in which it occurs in the literature on perceptrons: 

(1) Instead of assuming F to consist of unit vectors one can as- 
sume it to be a finite set, or to satisfy an upper and lower bound 
on length, that is, 3 a,b, such that 0 < a < |4>| < b for all $ e F. 

(2) Instead of replacing A by A + one can replace it by 
A -h &#, where k is a real number chosen by any one of the rules: 

k is a constant > 0 . 
k = 1/ | 4 >| , that is, add a unit vector. 

If c = 1 then k is just enough to bring (A + k$) • $ out 
k = c ^ - of the negative range. Or, one can use any value for c be- 
|4>| 2 tween 0 and 2. [Agmon, 1954] 

These and similar modifications do not change the theorem in the 

sense that A will still becom e , after a finite -n umber -o f tr a n sfers approach, cl 

to A - perrir^oluk^ IIuvvlvu the a c tual num b e r- of tra t re - 

feffr-wilL- be altered . It would also be interesting to compare the 

relative efficiency of the “local” perceptron convergence program soio-hon 15 

with more “global” analytic methods, for example, linear pro- v ' £j€yK 

gramming, for solving the system of inequalities in A: 

A - $ > 0, all 4> in F. 


11.4.1 More Than Two Classes 

A more substantial variation is obtained by allowing more than 
two classes of input figures. Let Fj, F 2 , ... be sets of figures and 
suppose that there are vectors A* and 5 > 0 such that 

$ e F/ implies that for all j ^ /, Af • $ > A* • $ + 6. 

The perceptron convergence theorem generalized to this case as- 
sures us that vectors with the same property can be found by 
following the usual principle of feedback: whenever one runs into 
a figure $ in F,- for which A, • < A 7 - 4> for some j\ A, must 

be “increased” and A ; “decreased.” 



[176] 11.4 Learning Theory 


This idea is expressed more precisely in the program: 


start: Choose any nonzero values for A ,, A 2 , 

test: Choose 1,7, and $ e F,. 

If A, • $ > A j • # go to test; 
otherwise go to change. 

change: Replace A/ by A/ + $. 

Replace A j by A y - - 4>. 

Go to TEST. 


The generalized theorem states that the program will go to 
change only a finite number of times. But this is possible only if 
the machine eventually stops making mistakes, that is, eventually 
every # in F, will make 

A,-# > A,-#, for all 7^2 

To prove this, let A,* . . . A,-* . . . A/ . . . A m * have the required 
property, and define A* to be the vector (in a larger space) defined 
by stringing together all their coefficients. Also, for each define 
4 >ij to be the vector that contains 4> and in the ith and yth 
blocks, with zeros elsewhere. Apply Theorem 11.1 to this large 
space. 

11.5 Application: Learning the Parity Predicate ^ PARITY 
As an example to illustrate the convergence theorem, we shall 
estimate the number of steps required to learn the coefficients for 
the parity predicate. We have shown in §10.1 that the solution 
vector with the smallest coefficients can be written 

I R| iertts ('f 1 ) terms 

A = '(2 |/?l ,2 |/?l 't .r?,2 , * , “-C. . . , 1). 


The length of this vector is given by 



Learning 11.5 [177] 


The corresponding unit vector is then 



The analysis of §10.1 shows that A • $ is 1 or -1. Since # has 
2 |/J| coefficients, each 0 or 1, we have 

> i = . 1 

So we can take 1/VlO 1 * 1 as 6. The number n of corrections is 
then bounded by 

n < Xr < 10 |Jf| . 

5 2 

We obtain a lower bound of 5 1 * 1 for n by observing that |Aj 
must be at least 5 |/?l/2 and that 

|A„ | < n. 

Combining these we have 
5 1 * 1 < n < 10 |/?i 

It is worth observing that if the convergence program had added 
$ instead of 4> we would have obtained, 

5 '*' , < n < 10 |R| that is, (f) < n < 10 |Jf| . 
max |#| 

More analysis would be necessary to decide whether this modi- 
fication would actually result in more rapid learning. In any case 
it is clear that the learning time must increase exponentially as a 
function of n. 

These inequalities give bounds on the number n of corrections 
or, what comes to the same thing, of errors. A calculation of the 
total number of rounds of the program must take account of the 




[178] 11.5 Learning Theory 


decreasing error rate as learning proceeds. It is, however, easy to 
see that the number M(r) of rounds needed to reduce the propor- 
tion of errors to a fraction r should satisfy the inequality M(r) < 
n/r on the assumption that the figures are presented to the ma- 
chine in random order. Thus it should take something less than 
10 |/?l+2 rounds to achieve a 1 percent error rate. 


11.6 The Convergence Procedure as Hill-Climbing 

It is instructive to examine the relation of the convergence pro- 
cedure to the general problem of “hill-climbing.” There, too, one 
tries to find an apparently globally defined solution (that is, the 
location of the absolute summit) by local operations (for example, 
steepest ascent). Success depends, however, on the extent that the 
summit is not as globally defined as it might appear. In cases 
where the hill has a complex form with many local relative peaks, 
ridges, etc., hill-climbing procedures are not always advan- 
tageous. Indeed, in extreme cases a random or systematic search 
might be better than a procedure that relentlessly climbs every 
little hillock. 

In a typical hill-climbing situation one tries to maximize a func- 
tion G( A) of a point A in ^-dimensional space. The simplest 
procedure computes the value of the “altitude” function G for a 
number of points A, + in the vicinity of the current point A,. 
On the basis of these experiments, a value # is chosen and A, + # 
is taken as A /+1 . The algorithm for the choice of # varies. It 
might, for example, use unit vectors in the directions of the axes 
as the #/, compute the direction of steepest ascent and take $ as 
the unit vector in this direction. A simpler procedure might take 
as <£ the first unit vector it finds with the property that G( A t + 
#) > G(A t ). The choice of the appropriate algorithm will depend 
on many considerations. If, however, the hill (that is, the surface 
defined by G) is sufficiently well behaved, any reasonably sophis- 
ticated algorithm will work. If the hill is very bad, no ingenious 
local tricks will do much good. See Figure 1 1 .8. 

Now the perceptron convergence procedure can be seen as a hill- 
climbing algorithm if we define the surface G by 




A good hill with a bad algorithm (example suggested by Oliver 
Selfridge). Hill-climbing along the two axes won’t work, for if 
A is a point on the ridge, both G( A + $ 1 ) and G(A + $ 2 ) are 
less than G(A). The “resolution” of the test vectors is too coarse 
for the sharpness of the ridge. 


Figure 1 1.8 



[180] 11.6 Learning Theory 


C(A) . 

It differs from the usual form in two superficial respects. First, 
the algorithm has no procedure for systematically exploring the 
effects of moving in all directions from the current point A,. 
Second, it never actually has the value of the object function 
(7(A) since A* is, by definition, unknown. 

Nevertheless, the logic of its operation is essentially like the 
simpler of the two hill-climbing algorithms mentioned in the 
previous paragraph: the step from A, to A,+ i = A, + $ is based 
on evidence indicating (albeit indirectly) that (7(A /+1 ) is larger 
than (7( A,). One would expect its success to be related to the 
shape of the surface (7(A). And, indeed, a little thought will show 
that this surface has none of the pathological features likely to 
make hill-climbing difficult: there are no false local maxima, no 
ridges, no plateaus, etc. This is most easily seen by considering 
the function (7(A) on the unit sphere, where A = A/ | A | . For A 
satisfying A • A* > 0 (the only ones we need consider) this 
surface is an ^-dimensional cone. It has a single peak at A = A*, 
connected uniform contours, straight lines of steepest ascent; in 
short, all features a hill-climber could desire. 

Thus, we see, from another point of view, that the convergence 
theorem is neither as surprising nor as isolated a phenomenon as 
it might at first appear. 

11.7 Perceptrons and Homeostats 

The significance of the perceptron convergence theorem must not 
be reduced (as it often has been, in the literature) to the mere 
statement: If two sets of figures are linearly separable then the con- 
vergence theorem procedure can find a separating predicate. For if 
OYiediJ not A all one -w anted is to fi -n d a - s e parating predicat e, a more trivial 
care also procedure would suffice. 

practical/ 0 b serve fj rst that jf there exists a vector A* such that 

A* • <£ > S > 0 for all <t> e F, then there exists a vector A' with 
the same property and with integer components. We can therefore 
find a suitable A' by the simple program: 



Learning 11.8 [181] 


start: SetA 0 = 0. 

test: Choose $ e F. 

If A • # > 0 go to test; 
otherwise go to generate. 

generate: Replace A by T( A) where T is any trans- 
formation such that the series 7X0), 
7X7(0)), 7X 7X7X0))), . . . , includes all pos- 
sible integral vectors. 

Go to TEST. 


Clearly, the procedure can make but a finite number of errors 
before it hits upon a solution. It would be hard to justify the 
term “learning” for a machine that so relentlessly ignores its 
experience. 

The content of the perceptron convergence theorem must be that 
it yields a better learning procedure than this simple homeostat. 
Yet the problem of relative speeds of learning of perceptrons 
and other devices has been almost entirely neglected. There is not 
yet any general theory of this topic; in §11.5 we discussed some 
of the problems encountered in estimating learning times. Some 
other simple methods of “learning” will be discussed in Chapter 
12. The logical theory of homeostats, that is, enumerative pro- 
cedures like the one mentioned just above, is discussed by W. 
Ross Ashby in the book Design for a Brain. 


11.8 The Nonseparable Case 

There are many reasons for studying the operation of the percep- 
tron learning program when there is no A* with the property 
A* • 4> > 0 for all <FeF. Some of these are practical reasons. For 
example, one might want to use the program to test whether such 
an A* exists, or one might wish to make a learning machine of 
this sort and be worried about the possible effects of feedback 
errors and other “noise.” Other motives are theoretical. One can- 
not claim to have completely understood the “separable case” 
without at least some broader knowledge of other cases. 



[182] 11.8 Learning Theory 


Now it is perfectly obvious that Theorem 11.1 cannot be true, as 
it stands, under these more general conditions. It must be possible 
for A to change infinitely often. However, the fate of A is not 
obvious: will | A | increase indefinitely? Will A take infinitely 
many values or will it cycle through or in some other way remain 
in some fixed finite set? 

In the next sections we shall prove that | A | remains bounded. 
To be more precise we introduce the following definitions: Let F 
be a finite set of vectors. Then 

An F-chain is a sequence of vectors A u A 2 , . . . , A„, for which 


f A, + 1 = A / 4- *, 

J • A/ < 0, 
l*/«F. 

An F-chain is proper if, for all /, 


I A/l > I A 1 1 . 

We will prove that F-chains beginning with large vectors cannot 
grow much larger. 


11.9 The Perceptron Cycling Theorem 

For any e > 0 there is a number N = A(e, F) such that 


if 


A, . . . , A' is a proper F-chain and 
|A | > A, 


then | A ' | < | A | + c. 


Corollary 1: The lengths | A | of vectors obtainable by executing 
the learning program, with a given F and a given initial vector, 
are bounded. If the finite set of vectors in F are constrained to 
have integer coordinates, then the process is finite-state! 


The plausibility of these assertions is easily verified by examining 
Figure 11.10. As | A | grows it becomes increasingly difficult to 
find a member of F satisfying both A • < 0 and |A + <£| > 

| A | . The formal proof is given in §11.10 and uses induction on the 
dimension of the vectors in F. 



Learning 11.10 [183] 


1 CW proof <7f -tk(5 tleoveni is cowp healed 

a, ad U^core. So ave Tk* ofh« ^ proofs we 

Kai/e since seen. Sovel^ sow*o*e wiK ^mc) 
a, simpler a.®o coccc k • Non - s^eo i<t (isks s koo (d 

^ E© S'l. 

The theorem (in the form of Corollary 1) was apparently first 
conjectured by Nilsson, and proved by Efron. Terry Beyer formu- 
lated the conjecture quite independently. 

11.10 Proof of the Cycling Theorem 

The proof depends on some observations about the effect, on the 
length of an arbitrarily long vector A, of adding to it a short 
vector C whose length is fixed. 

11.10.1 Lemmas* 

If C is any vector, and A is very large compared to C, then 
|A + C| - [A 1 ^ A ■ C. 

More precisely, define A = |A + C| — |A |. Then for any e > 0, 
if we choose | A | > |C| 2 /e then the difference between A and 
A • C will be less than e. 

It is easy to read from the infinitesimal geometry of Figure 11.9 
that |A • C — A | < | B | sin 0 ^ |B| 2 /|a| < |C| 2 /|A|, when |A| » 

I C | . 



A formal proof is hardly necessary, but if we define x = |A + C| and 
y = | A | we can use the identity 

x 2 - y 2 = 2 y(x - y) + (x - y) 2 to obtain 
2A • C + |C| 2 = 2| A | A + A 2 ; hence 

2| A |(A • C - A) = A 2 - | C | 2 . Since | A | < |C| we then have 
I A • C - A| < |C | 2 / | A |. * 

Since this shows that A ^ A • C when |A| » |C| we can conclude that 
Lemma 1: We can make A as small as we want by setting a lower 


We denote by A the unit length vector along the direction of A. 



[184] 11.10 Learning Theory 


bound on |A| and an upper bound on A • C, that is, by taking 
A large, and nearly orthogonal to C. 

Lemma 2: We can make the angle (A, A + C) as small as we like 
by increasing |A| because sin 0 < |C| / |A|. 

Lemma 3: If a relatively small vector C is bounded away from 
orthogonal to a very large vector A, with negative inner product, 
then the A is bounded away from (negative) zero. In fact, if 
A • C < -6 < 0, then if we take |A| > (2/6) |C| we have (be- 
cause A approaches the negative quantity A • C |C|), 

A • C|C| < A < i A • C|C| <0. 

Thus A < - \ 6 |C|. 



I 

Figure 1 1.10 

Finally, we need one more substantial Lemma: 

Lemma 4: The projection of a proper F-chain A { , . . . , A k onto a 
hyperplane containing F is a proper F-chain. Moreover, the increase 
in length, |A*| - I A , | is not greater than that of the projected 
chain. 

proof: Let A h . . . , A* be a proper chain. Let H be a hyperplane 
containing F, and B the normal to H. Remember that B • = 0 

for all # 6 F. Write 

A i = A i + B To show A 1? . . . , A k is an F-chain, let 

A /+1 = A, + where A ; • # < 0. Now 
A,+ i = Aj+ \ + x i+ 1 B 
= (A / + ^) + X/B. 



Learning 11.10 [185] 


Then, by the orthogonality of ft to all of A„ A /+1 , and <F, 

*/+l = */ and A l+1 = A,- + $. 

Finally, putting B = jc, ft we get 

0 > A, • # = (A, + B)*$ = A/ • <l> + B • <£ = A,- • 4>. 

To show that the A’s form a proper F-chain we must also verify 
that | A, | > | A ,| . This follows immediately from 

I A 1 2 = | A 1 2 + 2A,*B + |B| 2 
= I A, | 2 + |B| 2 . 



Finally it is easy to see that 

|A*| - I A , | = VlA^ + lBl 2 - a/ | A i f 2 + |B | 2 < |A*| - |A,| 

so the latter must be positive. Q.E.D. 

11.10.2 Proof of Cycling Theorem 

We prove the theorem by induction on the dimension of the 
vector space. 

base: The theorem is obviously true in E u the one-dimensional 
case, for there the vectors are simply real numbers and # • A < 0 
means that # and A have opposite signs. If | A | > max |$| then 

F 

I# + A| < | A | for # • A < 0. So eventually |A,| must be less 
than max #. 

F 

induction: Assume for an inductive proof that the theorem is 
true in E n _ x . Note that this implies the existence of a bound 



[186] 11.10 Learning Theory 


M n . x such that any F-chain, A,, . . . , A ffl in E n _ ] can grow at most 
M n _ \ in length, that is, |Aj < |Aj| + M n _ 

Choose any direction (that is, unit vector) A in E n . Our first 
subgoal will be to construct an open neighborhood V(A) on the 
unit sphere from which the growth of chains is bounded; in fact, 
for any e > 0, there is a bound N( A) such that, if |B| > N(A) 
and B e V(A) then any proper F-chain starting at B can grow at 
most e in length. 

Since the open sets V(A) cover the surface of the unit sphere, 
and since the sphere is compact, it will follow that we can find one 
N that will work in place of all the N(k)'s and the theorem 
will be proved. 

Let H(A) be the hyperplane orthogonal to A and let H(A) be the 
complement of H (A), that is, H (A) = E n - H(A). 

Since F is finite, there is a number <5 > 0 such that |# • A| > 26 
for all 3> in H(A) Pi F. By continuity there is a neighborhood 
V'(A) such that if B e V'(A)and * e H(A) p F then |* • B| > 6. 

There is also a number b such that |<1>| < b for all e F. We can 
now deduce from Lemma 3 that there are numbers 6' and n( A) 
such that 


if 


r 


< 


(1) |B| >_«(A) 

(2) 6 H(A) n 

(3) B e V'(A) 

(4) €> • B <0 


then (5) |B + #| < |B 


These are the conditions of Lemma 3, where 
, B and <I> play the role of A and C of the 
lemma. Note that (2) keeps $ from being per- 
pendicular to A and (4) keeps $ from being 
perpendicular to B. 

- 5 ’. 


We shall consider a proper F-chain, 


B„...,B„... with B j+t -Bj+*j 

with B, very near A and |B, | > «( A). In particular, let ij > 0 be a 
number such that the diameter of V'(A) is bigger than 77 . Let V (A) 
be a neighborhood of A on the unit sphere such that the diameter 
of V(A) is less than 77 / 2 . So V(A) C V'(A). We now take B, 
such that B, « V (A) and |B,| > n(A) though we will shortly 



Learning 11.10 [187] 


change this lower bound on the magnitude of B, to the desired 

N( A). 

By (5) above, the chain cannot be proper unless e H(A). Thus 
the chain must start growing from H(A). We will see that not only 
<$!, but all the other <f>’s must be in H(A); hence the chain’s 
growth is bounded by e. For suppose that 

*j\ C H(A) and # y+1 c H(A). 

Then |B ;+1 | will be less than |Bi| by at least 8 '/ 2 . To see this we 
use Lemmas 1 and 2 and the inductive assumption. Since the pro- 
jections B By of B By form a proper F-chain in the 
(n — l)-dimensional space H(A), 

|By| < |Bj| + M n _ u 

where M„_ \ is the bound obtained by the induction assumption 
for the next lower dimension. Now, if 77 is chosen small enough 
and if N( A) is chosen large enough the conditions of Lemmas 1 
and 2 are satisfied with the following values: we use <l>, + • • • + 
for “C,” so that*|C| < M n _ 1; we use B! for ‘\4”; and we use 
a smaller 6' = min (e, <5'/2) for “c.” 

It follows from (5) and the fact that |By| > |Bj| > N(A) that 
I By 4. 1 1 < | By | - 8 ' < |Bj - 8'/2 so that the jump from By to 
By +1 decreased the length of the B-vector more than the first j 
steps increased it! Thus the chain cannot be proper unless all 
the <£’s belong to H(A). But in this case the increase in length 
of the whole chain is bounded by e. This achieves the first sub- 
goal. The surface of the sphere is covered by the V(A)’s. By com- 
pactness, there is a finite subcovering. Let N be the maximum of 
the corresponding jV(A)’s. It follows that for any proper chain 
B,...,B, 

|B| > N =* |B'| < |B | + e. 

This completes the proof of the cycling theorem. 

** The\re is a, gap A* ev^ ^ poinVeJ oO t qui ve cu^e i ty 

H.£>. %/ocL S.A ■ Lev'll : ?\roc, Aw?/. it 

?f .3^1- *35. 



Linear Separation and Learning 

12 


12.0 Introduction 

The perceptron and the convergence theorems of Chapter 1 1 are 
related to many other procedures that are studied in an extensive 
and disorderly literature under such titles as learning machines, 

MODELS OF LEARNING, INFORMATION RETRIEVAL, STATISTICAL DE- 
CISION theory, pattern recognition and many more. In this 
chapter we will study a few of these to indicate points of contact 
with the perceptron and to reveal deep differences. We can give 
neither a fully rigorous account nor a unifying theory of these 
topics: this would go as far beyond our knowledge as beyond the 
scope of this book. The chapter is written more in the spirit of 
inciting students to research than of offering solutions to prob- 
lems. 

12.1 Information Retrieval and Inductive Inference 

The perceptron training procedures (Chapter 1 1) could be used to 
construct a device that operates within the following pattern of 
behavior: 



Answers 


During a “filing” phase, the machine is shown a “data set” of 
Az-dimensional vectors — one can think of them as n ~ bit binary 
numbers or as points in rt-space. Later, in a “finding” phase, the 
machine must be able to decide which of a variety of “query” 
vectors belong to the data set. To generalize this pattern we will 



Linear Separation and Learning 12.1 [189] 


use the term A file for an algorithm that examines elements of a data 
set to modify the information in a memory. A file is designed to pre- 
pare the memory for use by another procedure, A find , which will 
use the information in the memory to make decisions about query 
vectors. 

This chapter will survey a variety of instances of this general 
scheme. We will begin by relating the perceptron procedure to 
the simplest such scheme: in the complete storage procedure 
A fi ,e merely copies the data vectors, as they come, into the 
memory. For each query vector, A find searches exhaustively 
through memory to see if it is recorded there. 

12.1.1 Comparing perceptron with complete storage 
Our purpose is to illustrate, in this simple case, some of the ques- 
tions one might ask to compare retrieval schemes: 

Is the procedure universal? The perceptron scheme works per- 
fectly only under the restriction that the data set is linearly 
separable. Complete storage is universal: it works for any 
data set. 

How much memory is required? Complete storage needs a mem- 
ory large enough to hold the full data set. Perceptron, when it 
is applicable, sometimes has a summarizing effect, in that the 
information capacity needed to store its coefficients {a,} is sub- 
stantially less than that needed to store the whole data set. We 
have seen (§10.2) that this is not generally true; the coefficients 
for ^parity may need much more storage than does the list of 
accepted vectors. 

How quickly does A find operate? The retrieval scheme — exhaustive 
search— specified for complete storage is very slow (usually 
slower than perceptron’s A find , which must also retrieve all its 
coefficients from memory). On the other hand, very similar pro- 
cedures could be much faster. For example, if A file did not just 
store the data set in its order of entry, but sorted the memory 
into numerical order, then A find could use a binary search, reduc- 
ing the query-answer time to 


log 2 (|data set|) 



[190] 12.1 Learning Theory 


memory references. We shall study (in §12.6) A file algorithms that 
sacrifice memory size to obtain dramatic further increases in 
speed (by the so-called hash-coding technique). 

Can the procedure operate with some degree (perhaps measured 
probabilistically ) of success even when A file has seen only a subset 
of the data set — call it a “data sample '*? Perceptron might, but 
the complete storage algorithm, as described, cannot make a 
reasonable guess when presented with a query vector not in the 
data sample. This deficiency suggests an important modification 
of the complete storage procedure: let A find , instead of merely 
checking whether the query vector is in the data sample, find that 
member of the data sample closest to it. This would lead, on an 
a priori assumption about the “continuity” of the data set, to a 
degree of generalization as good as the perceptron’s. Unfortu- 
nately the speed-up procedures such as hash-coding cease to be 
available and we conjecture (in a sense to be made more precise 
in §12.7.6) that the loss is irremediable. 

Other considerations we shall mention concern the operation of 
Arti e . We note that the perceptron and the complete storage 
procedures share the following features: 

They act incrementally, that is, change the stored memory slightly 
as a function of the currently presented member of the data set. 

They operate in “real time” without using large quantities of 
auxiliary scratch-pad memory. 

They can accept the data set in any order and are tolerant of 
repetitions that cause only delay but do not change the final 
state. 

On the other hand they differ in at least one very fundamental 
way: 

The perceptron’s A fi i e is a “search procedure” based on feedback 
from its own results. The complete storage file algorithm is pas- 
sive. The advantage for the perceptron is that under some condi- 
tions it finds an economical summarizing representation. The cost 
is that it may need to see each data point many times. 

12.1.2 Multiple Classification Procedures 

It is a slight generalization of these ideas to suppose the data 
set divided into a number of classes F!,...,F*. The algorithm 



Linear Separation and Learning 12.1 [191] 


A fii e is presented as before with members of the data set but also 
with indications of the corresponding class. It constructs a body 
of stored information which is handed over to A find whose task 
is to assign query points to their classes using this information. 

Example: We have seen (§11.3.1) how to extend the concept of 
the perceptron to a multiple classification. The training algorithm, 
A file , finds k vectors A,,..., A*, and A find assigns the vector <*> 
to F j if 

$ • A, > # • A i (all i ^ j). INNER PRODUCT 

Example: The following situation will seem much more familiar 
to many readers. If we think of each class F y as a “clump” 
or “cloud” or “cluster” of points in <£-space, then we can imagine 
that with each F 7 is associated a special point B ; that is, somehow, 
a “typical” or “average” point. For example, could be the 
center of gravity , that is, the mean of all the vectors in F ) (or, 
say, of just those vectors that so far have been observed to be in 
F )). Then a familiar procedure is: $ is judged to be in that 
F j for which the Euclidean distance 

I* - B;|. 

is the smallest . That is, each $ is identified with the nearest B- 
point. 

Now this nearness scheme and the inner-product scheme 
might look quite different, but they are essentially the same! 
For we have only to observe that the set of points closer to 
one given point B, than to another B 2 is bounded by a hyper- 
plane (Figure 12.1), and hence can be defined by a linear inequal- 



Figure 12.1 



[192] 12.1 Learning Theory 


ity. Similarly, the points closest to one of a number of B/s form a 
(convex) polygon (Figure 12.2) and this is true in higher dimen- 
sions, also. 



Formally, we see this equivalence by observing that 

I* - Bj\ 2 = |*f - 2* • By + |By| 2 . 

Now, if we can assume that all the <l>’s have the same length L then 
the Euclidean distance ( B ) will be smallest when 

* B, - i|By| 2 = * By - 9j 

is largest. But this is exactly the inner-product, if the “threshold” 
is removed by §1.2.1 (l). To see that the inner-product concept loses 
nothing by requiring the $’s to have the same length, we add an extra 
dimension and replace each # = [<p \ , . . . , (p n ] by 



so that all have length L 2 = n. We have to add one dimension to 
the B’s, too, but can always set its coefficient to zero. 

12.2 A Variety of Classification Algorithms 

We select, from the infinite variety of schemes that one might 
use to divide a space into different classes, a few schemes that 
illustrate aspects of our main subject: computation and linear 
separation. We will summarize each very briefly here; the re- 
mainder of the chapter compares and contrasts some aspects of 
their algorithmic structures, memory requirements, and commit- 
ments they make about the nature of the classes. 



Linear Separation and Learning 12.2 [193] 


Each of our models uses the same basic form of decision algo- 
rithm for A find . In each case there is assigned to each class F ) 
one or more vectors A,; we will represent this assignment by say- 
ing that A, is associated with F /(/) . Given a vector $, the decision 
rule is always to choose that F/o) for which A, -$ is largest. As 
noted in §12.1.2 this is mathematically equivalent to a rule that 
minimizes |<£ - A, | or some very similar formula. 

For each model we must also describe the algorithm A me that 
constructs the A/s, on the basis of prior experience, or a priori 
information about the classes. In the brief vignettes below, the 
fine details of the A fiIe procedures are deferred to other sections. 

12.2.1 The perceptron Procedure 

There is one vector Aj for each class F y . A fi i e can be the procedures 
described in §1 1.1 for the 2-class case and in §1 1.4.1 for the multi- 
class case. 


12.2.2 The bayes Linear Statistical Procedure 

Again we have one A y for each F,-. A fiie is quite different, however. 
For each class F y and each partial predicate <£>,-, define 



where p tj is the probability that <£>,• = 1, given that <l> is in F ; . Then 
define 


A j = (0 n wij, w 2J , ...). 

We will explain in §12.4.3 the conditions under which this use 
of “probability” makes sense, and describe some “learning” algo- 
rithms that could be used to estimate or approximate the h^/s. 

The bayes procedure has the advantage that, provided certain 
statistical conditions are satisfied, it gives good results for classes 
that are not linearly separable. In fact it gives the lowest possible 
error rate for procedures in which A fi i e depends only on condi- 
tional probabilities, given that the <^’s are statistically independent 
in the ; sense explained in §12.4.2. It is astounding that this is 
achieved by a linear formula. 



[194] 12.2 Learning Theory 


12.2.3 The best planes Procedure 

In different situations either perceptron or bayes may be 
superior. But often, when the F/s are not linearly separable, there 
will exist a set of A, vectors which will give fewer errors than 
either of these schemes. So define the best planes procedure to 
use that set of A/s for which choice of the largest A ; * gives the 
fewest errors. 

By definition, best planes is always at least as good as bayes 
or perceptron. This does not contradict the optimality of bayes 
since the search for the best plane uses information other than 
the conditional probabilities. Unfortunately no practically effi- 
cient A fi | e is known for discovering its A/s. As noted in §12.3, 
hill-climbing will apparently not work well because of the local 
peak problem. 

12.2.4 The isodata Procedure 

In the schemes described up to now, we assigned exactly one A- 
vector to each F-class. If we shift toward the minimum-distance 
viewpoint, this suggests that such procedures will work satisfac- 
torily only when the F-classes are “localized” into relatively iso- 
lated, single regions— one might think of clumps, clusters, or 
clouds. Given this intuitive picture, one naturally asks what to do 
if an F-class, while not a neat spherical cluster, is nevertheless 
semilocalized as a small number of clusters or, perhaps, a snake- 
like structure. We could still handle such situations, using the 
least-distance A find , by assigning an A-vector to each subcluster 
of each F, and using several A’s to outline the spine of the snake. 
To realize this concept, we need an A me scheme that has some 
ability to analyze distributions into clusters. We will describe one 
such scheme, called isodata, in §12.5. 

12.2.5 The nearest neighbor Procedure 

Our simplest and most radical scheme assumes no limit on the 
number of A-vectors. A file stores in memory every $ that has ever 
been encountered, together with the name of its associated F- 
class. Given a query vector <F 0 , we find which $ in the memory is 
closest to and choose the F-class associated with that #. 

This is a very generally powerful method: it is very efficient on 
many sorts of cluster configurations; it never makes a mistake on 
an already seen point; in the limit it approaches zero error 



Linear Separation and Learning 12.3 [195] 


except under rather peculiar circumstances (one of which is dis- 
cussed in the following section). 

nearest neighbor has an obvious disadvantage — the very large 
memory required — and a subtle disadvantage: there is reason to 
suspect that it entails large, and fundamentally unavoidable, com- 
putation costs (discussed in §12.6). 

12.3 Heuristic Geometry of Linear Separation Methods 

The diagrams of this section attempt to suggest some of the be- 
havioral aspects of the methods described in §12.4. To compen- 
sate for our inability to depict multidimensional configurations, 
we use two-dimensional multivalued coordinates. The diagrams 
may appear plausible, but they are really defective images that 
do not hint at the horrible things that can happen in a space of 
many dimensions. 

Using this metaphorical kind of picture, we can suggest two kinds 
of situations which tend to favor one or the other of bayes or 
perceptron (see Figure 12.3). 




The bayes line in Figure 12.3 tends to lie perpendicular to the 
line between the “mean” points of F_ and F + . Hence in Figure 
12.3(a), we find that bayes makes some errors. The sets are, in 
fact, linearly separable, hence perceptron, eventually, makes no 
errors at all. In Figure 12.3(b) we find thdt bayes makes a few 
errors, just as in 12.3(a). We don’t know much about perceptron 
in nonseparable situations; it is clear that in some situations it 



[196] 12.3 Learning Theory 


will not do as well as bayes. By definition best plane, of course, 
does at least as well as either bayes or perceptron. 

From the start, the very suggestion that any of these procedures 
will be any good at all amounts to an a priori proposal that the 
F-classes can be fitted into simple clouds of some kind, perhaps 
with a little overlapping, as in Figure 12.4. Such an assumption 




could be justified by some reason for believing that the differences 
between F+ and F_ are due to some one major influence plus 
a variety of smaller, secondary effects. In general perceptron 
tends to be sensitive to the outer boundaries of the clouds, and 
relatively insensitive to the density distributions inside, while 
bayes weights all 4>’s equally. In cases that do not satisfy either 
the single-cloud or the slight-overlap condition (Figure 12.5), we 
can expect bayes to do badly, and presumably perceptron also. 
best plane can be substantially better because it is not subject 
to the bad influence of symmetry. But finding the best plane is 



Figure 12.5 



Linear Separation and Learning 12.3 [197] 


likely to involve bad computation problems because of multiple, 
locally optimal “hills.” Figure 12.6 shows some of the local peaks 
for best plane in the case of a bad “paritylike” situation. Here, 
even isodata will do badly unless it is allowed to have one A- 
vector for nearly every clump. But in the case of a moderate 
number of clumps, with an A* in each, isodata should do quite 
well. (See §12.5.) Generally, we would expect perceptron to be 
slightly better than bayes because it exploits behavioral feedback, 
worse because of undue sensitivity to isolated errors. 



Figure 12.6 


One would expect the nearest neighbor procedure to do well 
under a very wide range of conditions. Indeed, nearest 
neighbor in the limiting case of recording all 3>’s with their class 
names, will do at least as well as any other procedure. There are 
conditions, though, in which nearest neighbor does not do so 
well until the sample size is nearly the whole space. Consider, 
for example, a space in which there are two regions: 



P(f £ F + ) -- l-f -- ? 




[198] 12.3 Learning Theory 


In the upper region a fraction p of the points are in F+, and these 
are randomly distributed in space, and similarly for F_ in the 
lower region. Then if a small fraction of the points are already 
recorded, the probability that a randomly selected point has the 
same F as its nearest recorded neighbor is 

P 2 + q 2 = 1 - 2pq , 

while the probability of correct identification by bayes or by 
best plane is simply p. Assuming that p > \ (if not, just ex- 
change p and q) we see that 

Error BES T plane < Error NEARESTNE1GHBO R < 2 x Error BEST PLANE 

so that nearest neighbor is worse than best plane, but not 
arbitrarily worse. This effect will remain until so many points 
have been sampled that there is a substantial chance that the 
sampled point has been sampled before, that is, until a good 
fraction of the whole space is covered. 

On the other side, to the extent that the “mixing” of F + and 
F_ is less severe (see Figure 12.7), the nearest neighbor will 
converge to very good scores as soon as there is a substantial 
chance of finding one sampled point in most of the microclumps. 



Figure 12.7 

A very bad case is a paritylike structure in which nearest 
neighbor actually does worse than chance. Suppose that $ e Fj if 
and only if v?, = 1 for an even number of /’ s. Then, if there 
are n <p' s, each $ will have exactly n neighbors whose distance d 




Linear Separation and Learning 12.4 [199] 


satisfies 0 < d < 1. Suppose that all but a fraction q of all pos- 
sible #’s have already been seen. Then nearest neighbor will 
err on a given if it has not been seen (probability = q) but 
one of its immediate neighbors has been seen (probability = 

1 - q n ). So the probability of error is >q( 1 - q n ), which, for 
large n, can be quite near certainty. 

This example is, of course, “pathological,” as mathematicians like 
to say, and nearest neighbor is probably good in many real 
situations. Its performance depends, of course, on the precise 
“metric” used to compute distance, and much of classical statis- 
tical technique is concerned with optimizing coordinate axes and 
measurement scales for applications of nearest neighbor. 

Finally, we remark that because the memory and computation 
costs for this procedure are so high, it is subject to competition 
from more elaborate schemes outside the regime of linear separa- 
tion — and hence outside the scope of this book. 

12.4 Decisions Based on Probabilities of Predicate- Values 

Some of the procedures discussed in previous sections might be 
called “statistical” in the weak sense that their success is not 
guaranteed except up to some probability. The procedures dis- 
cussed in this section are statistical also in the firmer sense that 
they do not store members of the data set directly, but instead 
store statistical parameters, or measurements, about the data set. 
We shall analyze in detail a system that computes — or estimates — 
the conditional probabilities p u that, for each class F,- the predicate 
<Pi has the value 1. It stores these p,-s together with the absolute 
probabilities /?, of $ being in each of the F/s. 

Given an observed <f>, the decision to choose an F, is a typical 
statistical problem usually solved by a “maximum likelihood” or 
Bayes-rule method. It is interesting that procedures of this kind 
resemble very closely the perceptron separation methods. In fact, 
when we can assume that the conditional probabilities p,, are suit- 
ably independent (§12.4.2) it turns out that the best procedure is 
the linear threshold decision we called bayes in §12.2.2. We now 
show how this comes about. 

12.4.1 Maximum Likelihood and Bayes Law 

In Chapter 1 1 we assumed that each is associated with a unique 
F j. We now consider the slightly more general case in which the 



[200] 12.4 Learning Theory 


same # could be produced by events in several different F-classes. 
Then, given an observed # we cannot in general be sure which 
F j is responsible, but we can at best know the associated probabil- 
ities. 

Suppose that a particular # 0 has occurred and we want to know 
which F is most likely. Now if F, is responsible, then the “joint 
event” F 7 A # 0 has occurred; this has (by definition) probability 
(P (F, A 4>o). Now (by definition of conditional probability) we can 
write 

(P(F y A # 0 ) = (p (Fj) *(P($o i F 7 ). (1) 

That is, the probability that both F y and # 0 w iH happen together is 
equal to the probability that F, will occur multiplied by the prob- 
ability that if F, occurs so will # 0 - 

We should choose that F y which gives Formula 1 the largest value 
because that choice corresponds to the most likely of those events 
that could have occurred; 

F, A # 0 F 2 A # 0 • • • F k A # 0 - 

These are serious practical obstacles to the direct use of Formula 1 . 
If there are many different #’s it becomes impractical to store all 
the decisions in memory, let alone to estimate them all on the 
basis of empirical observation. Nor has the system any ability 
to guess about #’s it has not seen before. We can escape all these 
difficulties by making one critical assumption — in effect, assum- 
ing the situation closely fits a certain model — that the partial 
predicates of# = (<p i, . . . , <p m ) are suitably independent. 

12.4.2 Independence 

Up to now we have suppressed the T’s of earlier chapters because we did 
not care where the values of the s came from. We bring them back for 
a moment so that we can give a natural context to the independence 
hypothesis. 

We can evade the problems mentioned above if we can assume 
that the tests <p,-( X) are statistically independent over each F-class. 
Precisely, this means that for any #(A") = (ipfX), . . . , (p m (X)) we 
can assert that, for each j, 

(P($ | F,) = (P(<p, | F,-) x • • • x | Fj). 


( 2 ) 



Linear Separation and Learning 12.4 [201] 


We emphasize that this is a most stringent condition. For exam- 
ple, it is equivalent to the assertion that: 

Given that a <£ is in a certain F -class, if one is told also the values 
of some of the (p's, this gives absolutely no further information about 
the values of the remaining (p's. 

Experimentally, one would expect to find independence when the 
variations in the values of (p's are due to “noise” or measurement 
uncertainties within the individual ^-mechanisms: 



For, to the extent that these have separate causes, one would not 
expect the values of one to help predict the values of another. But 
if the variations in the (p's are due to selection of different X s from 
the same F -class, one would not ordinarily assume independence, 
since the value of each (p tells us something about which X in F 
has occurred, and hence should help at least partly to predict the 
values of other (p's: 



Figure 12.9 

An extreme example of nonindependence is the following: there 
are two classes, F 1 and F 2 , and two <^’s, defined by 




< 


<Pi(X) 


a pure random variable wi£h (P (<pi(X) = 1) = 
Its value is determined by tossing a coin, not by X. 
(p,(X)ifXeF h 
1 - <p,(X) if X e F 2 . 







[202] 12.4 Learning Theory 


Then <?(<P, A <p 2 l^i) = 5 . 

But <P((P\ I F\) • (P(<£> 2 1 F\) = i • i 

Notice that neither <p l nor <p 2 taken alone gives any information 
whatever about F! Each appears to be a random coin toss. But 
from both one can determine perfectly which F has occured, 

for <p\ = 2 implies Fj, 

while (f \ ^ <f 2 implies F 2 

with absolute certainty. 

remark: We assume only independence within each class F y . So if X is 
not given, then learning one <^-value can help predict another. For ex- 
ample, suppose that 

if 1 = f 2 = 0 if X e Fj , 

<p 1 = f 2 = 1 if X t F 2 . 

These two <p s are in fact independent on each F. But if we did not know in 
advance that X e F\ but were told that <p\ = 0, we could indeed then 

predict that y? 2 = 0 also, without this violating our independence as- 

sumption. (If we had previously been told that T e F], then we could 
already predict the value of v? 2 ; in that case learning the value of <p\ 
would have no effect on our prediction of <p 2 .) 

12.4.3 The Maximum Likelihood Decision, for Independent y?’s, 

Is a Linear Threshold Predicate! 

Assume that the y?/s are statistically independent for each F ; . 
Define 


Pj =<P(F,), 

Pij = (P(<Pi = 1 |Fy), 

Pij = 1 - Pa = 9 '{‘Pi = 0 |F,). 

Suppose that we have just observed a # = ((p j, . . . , <p m ), and we 
want to know which F y was most probably responsible for this. 
Then, according to Formulas 1 and 2, we will choose that j which 
maximizes 

Pj ■ n pu ■ ru, 

<Pj = I <^/ = 0 



Linear Separation and Learning 12.4 [203] 


= P U 


<pi (1 —<Pj) 


= Pi 


. nw' • ii 


in- 


Because sums are more familiar to us than products, we will 
replace these by their logarithms. Since log x increases when x 
does, we still will select the largest of 


E <Pi • log — + (log Pj + X3°g Quj- (3) 

/ R ij \ i / 

Because the right-hand expression is a constant that depends only 
upon j, and not upon the experimental <I>, all of Formula 3 can 
be written simply as 

2 Wijfpi + Oj. (3') 

Example 1: In the case that there are just two classes, F s and F 2 , 
we can decide that X e F whenever 


2 w i\<Pf 4 - 0 1 > 2 w i2 (pj + 0 2 , 


that is, when 


2 (w n - w a )<Pi > (0 2 - 0 1), (4) 

which has the form of a linear threshold predicate 

\p = f 2 a i (f j - > ^1 . 

Thus we have the remarkable result that the hypothesis of inde- 
pendence among the <^’s leads directly to the familiar linear 
decision policy. 

Example 2 (probabilities of error): Suppose that for all /, p n = q i2 . 
Then p n is the probability that <pi(X) = \p{X) and q n is the prob- 
ability that (fi(X) ^ yp{X), that is, that makes an error in 
(individually) predicting the value of \p = \X e F { ). 



[204] 12.4 Learning Theory 


Then inequality 4 becomes 

^w n (2<p, - 1) > log—. (4') 

i P 1 

Now observe that the (2<p, — 1) term has the effect of adding or 
subtracting w n according to whether <pi = 1 or 0. Thus, we can 
think of the w’s as weights to be added (according to the <^’s) to 
one side or the other of a balance: 



The log (P 2 /P\) is the “ a priori weight” in favor of F 2 at the 
start, and each w n = log (pn/qn) is the “weight of the evidence” 
that<£>/= 1 gives in favor of Fi. 

It is quite remarkable that the optimal separation algorithm — given that 
the ^-probabilities are independent — has the form (Inequality 4) of a 
linear threshold predicate. But one must be sure to understand that if 
[2a^ > 0] is the “optimal” predicate obtained by the independent- 
probability method, yet does not perfectly realize a predicate \ p, this 
does not preclude the existence of a precise separation > 0'] 

which always agrees with \p. [This is the situation suggested by Figure 
12.3(a).] For Inequality 4 is “optimal” only in relation to all A fi | e pro- 
cedures that use no information other than the conditional probabilities 
\pj\ and j pij), while a perceptron computes coefficients by a nonstatistical 
search procedure that is sensitive to individual events. 

Thus, if f is in fact in L(<t>) the perceptron will eventually perform at 
least as well as any linear-statistical machine. The latter family can have 
the advantage in some cases: 

1. If \p j L(<t>) the statistical scheme may produce a good approximate 
separation while the perceptron might fluctuate wildly. 

2. The time to achieve a useful level may be long for the perceptron file 
algorithm which is basically a serial search procedure. The linear-statis- 



Linear Separation and Learning 12.4 [205] 


tical machine is basically more parallel, because it finds each coefficient 
independently of the others, and needs only a fair sample of the F’s. 
(While superficially perceptron coefficients appear to be changed indi- 
vidually, each decision about a change depends upon a test involving all 
the coefficients.) 

12.4,4 Layer-Machines 

Formula 3' suggests the design of a machine for making our 
decision: 



D is a device that simply decides which of its inputs is the largest. 
Each ^-device emits a standard-sized pulse [if <p(X) = 1] when X 
is presented. The pulses are multiplied by the w, 7 quantities as 
indicated, and summed at the 2-boxes. The 0j terms may be re- 
garded as corrections for the extent to which the pjs deviate from 
a central value of combined with the a priori bias concerning 
F y itself. 

It is often desirable to minimize the costs of errors, rather than simply 
the chance of error. If we define Cy* to be the cost of guessing F* when it 
is really Fy that has occurred, then it is easy to show that Formulas 1 and 
2 now lead to finding the k that minimizes 


Z c Jk b, n 

j i 



where Bj = II < 7 ,y. It is interesting that this more complicated procedure 
also lends itself to the multilayer structure: 










[ 206 ] 12.4 Learning Theory 



It ought to be possible to devise a training algorithm to optimize the 
weights in this using, say, the magnitude of a reinforcement signal to 
communicate to the net the cost of an error. We have not investigated 
this. 

12.4.5 Probability-estimation Procedures 

The A m e algorithm for the bayes linear-statistical procedure has 
to compute, or estimate, either the probabilities and pj of 
Formula 3 or other statistical quantities such as “weight of evi- 
dence” ratios p/( 1 - p). Normally these cannot be calculated 
directly (because they are, by definition, limits) so one must find 
estimators . The simplest way to estimate a probability is to 
find the ratio H/N of the number H of “favorable” events to the 
number N of all events in a sequence. If <p [t] is the value of (p on 
the tth trial, then an estimate of (P(<p = 1) after n trials can be 
found by the procedure: 

start: Set a to 0. 

Set n to 1 . 

O . (/! - l)a + *p [n] 

repeat: Set a to . 

n 

Set n to n + 1 . 

Go to REPEAT. 

which can easily be seen to recompute the “score” H/N after 
each event. 

This procedure has the disadvantage that it has to keep a record 
of n , the number of trials. Since n increases beyond bound, this 
would require unlimited memory. To avoid this, we rewrite the 












Linear Separation and Learning 12.4 [207] 


above program’s computation in the form 

a 1 " 1 = ^ “ 

This suggests a simpler heuristic substitute: define 

a [01 = 0, 

= (i _ + * • * w , (5) 

where e is a “small” number between 0 and 1. It is easy to show 
that as n increases the expected or mean value of a ln \ which we 
will write as (a |ni >, approaches p (that is, ( <p) ) as a limit. For 

(a 1 ' 1 ) = (1 - «)(a 101 ) + = tp 

= [1 - (1 - «)]/>, 


+ 1 
n / n 


and 

<a 121 > = (1 - e)(l - (1 - e))p + ep 
= (1 - (1 - e) 2 )p, 

and one can verify that, for all n, 

<« w > = (i - (i - *Y)P 

— ► p. (as « — ► oc ) 

Thus, Process 5 gives an estimation of the probability that <p = 

A more detailed analysis would show how the estimate depends 
on the events of the recent past, with the effect of ancient events 
decaying exponentially — with coefficients like (1 - e) (/o_r) . 

Because Process 5 “forgets,” it certainly does not make “optimal” 
use of its past experience; but under some circumstances it will be 
able to “adapt” to changing environmental statistics, which could 
be a good thing. As a direct consequence of the ^ecay, our esti- 
mator has a peculiar property: its variance “(r 2 ” does not ap- 
proach zero. In fact, one can show that, for Process 5, 


e 


p ( i - p) 


2 - € 



[208] 12.4 Learning Theory 


and this, while not zero, will be very small whenever e is. The 
situation is thus quite different from the H/N estimate— whose 
variance is p( 1 - p)/n and approaches zero as n grows. 

In fact, we can use the variance to compare the two procedures: 
If we “equate” the variances 

p { i - p) ■ - />) • -L 

2 — 6 A2 


we obtain 



6 


suggesting that the reliability of the estimate of p given by Process 
5 is about the same as we would get by simple averaging of the 
last 2/e samples; thus one can think of the number 1/6 as cor- 
responding to a “time-constant” for forgetting. 



Convergence to the Fixed-Point 


Another estimation procedure one might consider is: 

start: Set a to anything. 

If ip = 1, set a to a -I- 1. 

If V? = 0, set a to (1 - e)a. 

Go to REPEAT. 


repeat: 




Linear Separation and Learning 12.4 [209] 


or, equivalently, one could write 

= (1 - e)a [n ~ [] + (1 + ea [n - l] )<p [n K 
It can be shown that this has an expected value, in the limit, of 



It is interesting that a direct estimate of the likelihood ratio is 
obtained by such a simple process as if = 1 add 1, otherwise 
multiply by ( 1 - e). The variance, in case anyone cares, is 


(i - p) 2 ’ 1 - (1 - o 2 ' 

12.4.6 The Samuel Compromise 

In his classical paper about “Some Studies in Machine Learning 
using the Game of Checkers,” Arthur L. Samuel uses an in- 
genious combination of probability estimation methods. In his 
application it occasionally happens that a new' evidence term tp,- is 
introduced (and an old one is abandoned because it has not been 
of much value in the decision process). When this happens there is 
a problem of preventing violent fluctuations, because after one or 
a few trials the term’s probability estimate will have a large 
variance as compared with older terms that have better statistical 
records. Samuel uses the following algorithm to “stabilize” his 
system: he sets a [()] to \ and uses 


>+i] 


= 1 


N 


+ N« V 


where N is set according to the “schedule”: 
f 16 if n < 32, 

N =] T" if 2"' < n < 2 m+l and 32 < n < 256, 

[256 if 256 < n. 


Thus, in the beginning the estimate is made as though the prob- 
ability had already been estimated to be | on the basis of several. 



[210] 12.4 Learning Theory 


that is, the order of 16, trials. Then in the “middle” period, the 
algorithm approximates the uniform weighting procedure. Finally 
(when n ~ 256) the procedure changes to the exponential decay 
mode, with fixed N, so that recent experience can outweigh earlier 
results. (The use of the powers of two represents a convenient 
computer-program technique for doing this.) 

In Samuel’s system, the terms actually used have the form we 
found in Inequality 4' of §12.4.3 

2ip [t] - 1 

so that the “estimator” ranges in the interval - 1 < p [,] < + i 
and can be treated as a “correlation coefficient.” 

12.4.7 A Simple “Synaptic” Reinforcer Theory 

Let us make a simple “neuronal model.” The model is to estimate 
Pi! = P(<pi\Fj), using only information about occurrences of 
\<Pi = 1] and of [<L e F y ]. Our model will have the following 
“anatomy”: 



The bag B , contains a very high and constant concentration of a 
substance E. When or F ) occur — or “fire” — the walls of the 
corresponding bags and/or C, become “permeable” to E for a 
moment. If (pi alone occurs, nothing really changes, because /?, is 
surrounded by the impermeable C y . If F y alone occurs, Cj loses 
some E by diffusion to the outside: in fact, if a is the amount of 
E in Cj it may be assumed (by the usual laws of diffusion and 
concentration) to lose some fraction e of a: 


a' = (1 — e)a if 


J F j occurs and 

[ <Pi = 0 . 


If both (fi and F y occur then approximately the same loss will occur 
from Cj. Simultaneously, an essentially constant amount b will be 
“injected” by diffusion from B t to C f . So 

, x \Fj occurs and 

a = (1 — e)a + b if ^ 





Linear Separation and Learning 12.5 [21 1J 


(We can assume that b is constant because the concentration of E 
is very high in Z?, compared to that in C 7 . One can invent any 
number of similar variations.) In either case we get 


a' = (1 — e)a + <pb Jk. . p 

so that in the limit the mean of a approaches ~b — p- (as can be 
seen from the analysis in §12.4.5). This is proportional to, and 
hence an estimator of p tj = P(<^, | F 7 ). 


Thus the simple anatomy, combined with the membrane be- 
coming permeable briefly following a nerve impulse, could give a 
quantity that is an estimator of the appropriate probability. 


How could this representation of probability be translated into a 
useful neuronal mechanism? One could imagine all sorts of 
schemes: ionic concentrations or rather, their logarithms! 
could become membrane potentials, or conductivities, or even 
probabilities of occurrences of other chemical events. The “an- 
atomy” and “physiology” of our model could easily be modified 
to obtain likelihood ratios. Indeed, it is so easy to imagine 
variants- the idea is so insensitive to details — that we don’t pro- 
pose it seriously, except as a family of simple yet intriguing 
models that a neural theorist should know about. 


12.5 A fi]e Algorithms for the isodata Procedure 
In this section we describe a procedure proposed by G. Ball and 
D. Hall to delineate “clusters” in an inhomogeneous distribution 
of vectors. The idea is best shown by a pictorial example: imagine 
a two-dimensional set of points (<F| that fall into obvious clusters, 
like 





[212] 12.5 Learning Theory 


Begin by placing a few “cluster-points” A[ 1] into the space at some 
arbitrary locations, say, near the center. We then divide the set 
of $’s into subsets R„ assigning each <1> to the nearest A[ 1] point: 



Next, we replace each A[ 1] by a new cluster-point A[ 21 which is the 
mean or center-of-gravity of the $’s in R ; , and then define R[ 2] to 
be the set of #’s nearest to A 2 : 







From now on, there is little or no change; the cluster-points have 
“found the clusters.” 

Ball and Hall give a number of heuristic refinements for creating and 
destroying additional cluster points; for example, to add one if the 
variance of an R-set is “too large” and to remove one if two are “too 





[ 214 ] 12.5 Learning Theory 


close” together. Of course, one can usually “see” two-dimensional 
clusters by inspection, but isodata is alleged to give useful results in 
^-dimensional problems where “inspection” is out of the question. 

To use this procedure, in our context, we need some way to com- 
bine its automatic classification ability with the information about 
the F-classes. An obvious first step would be to apply it separately 
to each F-class, and assign all the resulting A’s to that class. We 
do not know much about more sophisticated schemes that might 
lead to better results in the A find stage. 

12.5.1 An isodata Convergence Theorem 

There is a theorem about isodata (told to us by T. Cover) that suggests 
that it leads to some sort of local minimum. Let us formalize the pro- 
cedure by defining 

A [ " ] ($) = the A [ ” ] that is nearest to #. 

(If there are several equidistant nearest A/s, choose the one with the 
smallest index.) 

R [ / ] = the set of <Fs for which A [n] = A [ / ] . 

A [ / + 11 = mean<R [ ? ] >. 

Finally define a “score”: 

S 1 " 1 = _ A [ " i (*)| 2 - 

all $ 

Theorem: > 5(21 > • • • > ^ 1 • • • 

until there is no further change, that is, until A [ ^ = A^ + 1] for all z. 
proof: 

-ZZl* - A [ ; ] | 2 
j rW 

>ZI>- at "! 2 
J R[ P 

because the mean (A [ ” +1] ) of any set of vectors (R [ f) is just the point 
that minimizes the squared-distance sum to all the points (of R [ f). And 
this is, in turn, 

> Z Z 1* - A ‘" +I| i 2 - ^ + " 

Z R [tf+1] 



Linear Separation and Learning 12.6 [215] 


because each point will be transferred to an R t - + t] for which the distance 
is minimal, that is, for all y, 

[*- a [ ; +1 M > |# - a [ f\. 

Corollary: The sequence of decreasing positive numbers, approaches 

a limit. If there is only a finite set of frs, the A’s must stop changing in a 
finite number of steps. 

For in the finite case there are only a finite number of partitions {R,-j 
possible. 

12.5.2 Incremental Methods 

In analogy to the “reinforcement” methods in §12.4.5 we can ap- 
proximate isodata by the following program: 

start: Choose a set of starting points A/. 

repeat: Choose a <£. 

Find A($); the A, nearest to d>. 

Replace A (<£) by (1 - e)A(^) -ft*#. 

Go to REPEAT. 

It is clear that this program will lead to qualitatively the same sort 
of behavior; the A’s will tend toward the means of their R-regions. 
But, just as in §12.4, the process will retain a permanent sampling- 
and-forgetting variance, with similar advantages and disad- 
vantages. In fact, all the A fi i e algorithms we have seen can be so 
approximated: there always seems to be a range from very local, 
incremental methods, to more accurate, somewhat less “adap- 
tive” global schemes. We resume this discussion in §12.8. 

12.6 Time vs. Memory for Exact Matching 

Suppose that we are given a body of information — we will call 
it the data set — in the form of 2 a binary words each b digits in 
length (Figure 12.10); one can think of them as 2 a points chosen 
at random from a space of 2 b points. (Take a million ^ 2 20 words 
of length 100, for a practical example.) We will suppose that the 
data set is to be chosen at random from all possible sets so that 
one cannot expect to find much redundant structure within it. 
Then the ordered data set requires about b • 2 a bits of binary 
information for complete description. We won’t, however, be in- 



[216] 12.6 Learning Theory 




terested in the order of the words in the data set. This reduces 
the amount of information required to store the set to about 
(b - a) • 2° bits. 

We want a machine that, when given a random b - digit word w, 
will answer 

question 1 . Is w in the data set?* 

and we want to formulate constraints upon how this machine 
works in such a way that we can separate computational aspects 
from memory aspects. The following scheme achieves this goal 
well enough to show, by examples, how little is known about the 
conjugacy between time and memory. 

We will give our machine a memory of M separate bits — that is, 
one-digit binary words. We are required to compose — in advance, 
before we see the data set — two algorithms A fiie and A find that 
satisfy the following conditions: 

1. A fii e is given the data set. Using this as data, it fills the M bits of 
memory with information. Neither the data set nor A fi i e are used again, 
nor is Afj nd allowed to get any information about what Afn e did, except by 
inspecting the contents of M. 


*We will get to Question 2 in about fifteen minutes. 




Linear Separation and Learning 12.6 [217] 


2. A find is then given a random word, w, and asked to answer Question 1, 
using the information stored in the memory by A fi | e . We are interested 
in how many bits Afi nd has to consult in the process. 

3. The goal is to optimize the design of A fi i e and A find to minimize 
the number of memory references in the question-answering computa- 
tion, averaged over all possible words w. 

12.6.1 Case 1: Enormous Memory 

It is plausible that the larger be M, the smaller will be the average 
number of memory-references A find must make. Suppose that 

M > 2 b . 

Let mj be the / th bit in memory; then there is a bit m w for each 
possible query word w, and we can define 

(Afiie*. set m w to 1 if w is in the data set 
|Afi nd : w is in the data set if m w = 1 . 

Thus, with a huge enough memory, only one reference is required 
to answer Question 1. 

12.6.2 Case 2: Inadequate Memory 

Suppose that 

M <(b - d)2 a . 

Here, the problem cannot be solved at all, since A fi)e cannot store 
enough information to describe the data set in sufficient detail. 

12.6.3 Case 3: Binary Logarithmic Sort 

Suppose that 

M = b -2 a . 

Now there is enough room to store the ordered data set. Define 

"Afiie: store the words of the data set in ascending numerical 

I order. 

Afi nd : perform a binary search to see first which half of memory 

v might contain w, then which quartile, etc. 



[218] 12.6 Learning Theory 


This will require at most a = log 2 2 a inspections of Z?-bit words, 
that is, a • b bit-inspections. 

This is not an optimal search since, (1) one does not always need to 
inspect a whole word to decide which word to inspect next, and (2) it 
does not exploit the uniformity of distribution that the first a digits of 
the ordered data set will (on the average) show. Effect 1 reduces the 
required number from a • b to the order of \ a • b and effect 2 reduces 
it from a • b to a • (b — a). We don’t know exactly how these two 
effects combine. 

12.6.4 Case 4: Exhaustive Search 

Consider 

M = (b - a) 2 a . 

This gives just about enough memory to represent the unordered 
data set. For example we could define 

A file : First put the words of the data set in numerical order. 

Then compute their successive differences. These will re- 
quire about ( b - a) bits each. Use a standard method of 
information theory, Huffman Coding (say), to represent 
this sequence; it will require about ( b — a) 2 a bits. 

But the only retrieval schemes we can think of are like 

A find : Add up successive differences in memory until the sum 

equals or exceeds w. If equality occurs, then w is in the 
data set. 

And this requires ~\{b - a) 2 a memory references, on the aver- 
age. It seems clear that, given this limit on memory, no A file - 
A find pair can do much better. That is, we suspect that 

If no extra memory is available then to answer Question l one 
must , on the average , search through half the memory. 

One might go slightly further: even Huffman Coding needs some 
extra memory, and if there is none, A fi i e can only store an efficient 
“number” for the whole data set. Then the conjecture is that A find 
must almost always look at almost all of memory. 



Linear Separation and Learning 12.6 [219] 


12.6.5 Case 5: Hash Coding 

Consider 

M = 6 • 2" • 2. 

Here we have a case in which there is a substantial margin of 
extra memory — about twice what is necessary for the data set. 
The result here is really remarkable — one might even say counter- 
intuitive- because the mean number of references becomes very 
small. The procedure uses a concept well known to programmers 
who use it in “symbolic assembly programs” for symbol-table 
references, but does not seem to be widely known to other com- 
puter “specialists.” It is called hash coding. 

There are many variants of this idea. We discuss a particular form 
adapted to a redundancy of two. 

In the hash-coding procedure, A fI , e is equipped with a subprogram 
R(w,j) that, given an integer j and a 6-bit word w, produces an 
(a 4 - l)-bit word. The function R(wJ) is “pseudorandom” in the 
sense that for each j, R(wJ) maps the set of all 2 b input words 
with uniform density on the 2 a+1 possible output words and, for 
different f s, these mappings are reasonably independent or 
orthogonal. One could use symmetric functions, modular 
arithmetics, or any of the conventional pseudorandom methods.* 

Now, we think of the 6 • 2 a+, -bit memory as organized into 6-bit 
registers with (< a + l)-bit addresses: Suppose that A fi i e has already 
filed the words w u . . . , w„, and it is about to dispose of w n+l . 

Af lle : Compute R(w n+U 1). If the register with this address is 

empty put w n+l in it. If that register is occupied do the 
same with R(w n+l , 2), /?(w„ + i, 3), . . . until an unoccupied 
register R(w n+[ ,j) is found; file w n+1 therein. 

A find- Compute R ( w, 1). If this register contains w, then w is in 
the data set. If R(w , 1) is empty, then w is not in the data 
set. If /?(w, 1) contains some other word not w, then do 


*There is a superstition that R(w,j) requires some magical property that can only 
be approximated. It is true that any particular R will be bad on particular data 
sets, but there is no problem at all when we consider average behavior on all 
possible data sets. 



[220] 12.6 Learning Theory 


the same with /?(w, 2), and if necessary R(w, 3), R(w, 4), 

. . . , until either w or an empty register is discovered. 

On the average, A fi | e will make less than 2b memory-bit references ! 
To see this, we note first that, on the average, this procedure leads 
to the inspection of just 2 registers! For half of the registers are 
empty, and the successive values of R(wJ) for j = 1,2, ...are 
independent (with respect to the ensemble of all data sets) so the 
mean number of registers inspected to find an empty register is 2. 

Actually, the mean termination time is slightly less , because for w’s in 
the data set the expected number of inspected registers is < 2. The 
procedure is useful for symbol-tables and the like, where one may want 
not only to know if w is there, but also to retrieve some other data as- 
sociated (perhaps again by hash coding) with it. 

When the margin of redundancy is narrowed, for example, if 

M = — - — ■ b ■ 2°, 

n - 1 

then only (\/n)th of the cells will be empty and one can expect to 
have to inspect about n registers. 

Because people are accustomed to the fact that most computers 
are “word-oriented” and normally do inspect b bits at each 
memory cycle the following analysis has not (to our knowledge) 
been carried through to the case of 1-bit words. When we pro- 
gram A find to match words bit by bit we find that, since half 
the words in memory are zero, matching can be speeded up by 
assigning a special “zero” bit to each word. 

Assume, for the moment, that we have room for these 2° extra 
bits. Now suppose that a certain w 0 is not in the data set. (This 
has probability 1 - 2 a ~ b .) First inspect the “zero” bit associated 
with 7?(w 0 ,l). This has probability \ of being zero. If it is not zero, 
then we match the bits of w 0 with those of the word associated 
with R(w 0 , 1). These cannot all match (since w 0 isn’t in the data 
set) and in fact the mismatch will be found in (an average of) 
2 = 1 -f ^ + i + • • • references. Then the “zero” bit of /?(w 0 ,2) 
must be inspected, and the process repeats. Each repeat has prob- 
ability 2 and the process terminates when the “zero” bit of some 
R( w oJ) = 0. The expected number of references can be counted 



Linear Separation and Learning 12.6 [221] 


then as 

i(l 4- 2 + ^(1 4- 2 4- ^(. . .))) 4- 1 = 3 + 1 = 4. 

If w 0 is in the data set (an event whose probability is 2 a ~ b ) and we 
repeat the analysis we get 4 4 - b references, because the process 
must terminate by matching all b bits of vv 0 . 

The expected number of bit-references, overall, is then 

4(1 - 2 a ~ b ) 4- (4 + b) 2 a ~ b = 4 + b • 2 a ~ b 

~ 4 

since normally 2 a ~ b will be quite tiny. We consider it quite re- 
markable that so little redundancy — a factor of two — yields this 
small number! 

The estimates above are on the high side because in the case that vv 0 is 
in the data set the “run length” through R(w 0 ,j) will be shorter, by 
nearly a factor of 2, than chance would have it just because they were put 
there by A fi)e . On the other hand, we must pay for the extra “zero” 
bits we adjoined to M. If we have M = 2b *2° bits and make words 
of length b + 1 instead of b , then the memory becomes slightly more 
than half full: in fact, we must replace “4” in our result by something 
like 4 [(Z? + 1 )/{b - 1)]. Perhaps these two effects will offset one another; 
we haven’t made exact calculations, mainly because we are not sure that 
even this A fi | e -A find pair is optimal. 

It certainly seems suspicious that half the words in memory are simply 
empty! On the other hand, the best one could expect from further im- 
proving the algorithms is to replace 4 by 3 (or 2?), and this is not a large 
enough carrot to work hard for. 

12.6.6 Summary of Exact Matching Algorithms 

To summarize our results on Question 1 we have established 
upper bounds for the following cases: We believe that they are 
close to lower bounds also but, especially in cases 3 and 4, are 
not sure. 


Case 

Memory size 

Bit-references 

Method 

2 

<{b - a) 2° 

OO 

(impossible) 

4 

(b - a)2° 

Hb - a) 2“ 

(search all memory) 

3 

b- 2“ 

\b • a 

(logarithmic sort) 

5 

2b • 2 a 

4 4- 6 

(hash coding) 

1 

>2 b 

1 

(table look-up) 



[222] 12.7 Learning Theory 


12.7 Time vs. Memory for Best Matching: An Open Problem 

We have summarized our (limited) understanding of “Question 1” 
— the exact matching problem — by the little table in §12.6.6. If 
one “plots the curve” one is instantly struck by the effectiveness 
of small degrees of redundancy. We do not believe that this 
should be taken too seriously, for we suspect that when the prob- 
lem is slightly changed the result may be quite different. We con- 
sider now 

question 2: Given w, exhibit the word w closest to w in the data 
set . 

The ground rules about A file and A find are the same, and distance 
can be chosen to be the usual metric, that is, the number of digits 
in which two words disagree. If X \ , . . . , x b and X \ , . . . , x b are the 
(binary) coordinates of points w and w then we define the 
Hamming distance to be 

b 

d(w, >v) = I*/ “ *il- 

/= 1 

One gets exactly the same situation with the usual Cartesian 
distance C(w, vv), because 

[C(w, iv)] 2 = 2 \xi - x,-p = 2 |x, - xt \ = d(w,w) 

so both C(w, w) and d(w , w) are minimized by the same w. 

12.7.1 Case 1 : M = 2 h -b. 

A fiie assigns for every possible word w a block of b bits that con- 
tain the appropriate bits of the correct w. 

Afi nd looks in the block for w and writes out w. It uses b references, 
which seems about the smallest possible number. 

12.7.2 Case 2: M < (b - a) 2 °. 

Impossible, for same reason as in Question 1. 

12.7.3 Case 3: M = b-2 a 
No result known. 

12.7.4 Case 4: Af = (b - a) 2 a 

This presumably requires ( b - a)-2 a references, that is, all of 
memory, for the same reason as in Question 1 . 



Linear Separation and Learning 12.7 [223] 


12.7.5 Case 5: (b - a) 2 a < M < b • 2 b 
No useful results known. 

12.7.6 Gloomy Prospects for Best Matching Algorithms 

The results in §12.6.6 showed that relatively small factors of re- 
dundancy in memory size yield very large increases in speed, for 
serial computations requiring the discovery of exact matches. 
Thus, there is no great advantage in using parallel computational 
mechanisms. In fact, as shown in §12.6.5, a memory-size factor of 
just 2 is enough to reduce the mean retrieval time to only slightly 
more than the best possible. 

But, when we turn to the best match problem, all this seems to 
evaporate. In fact, we conjecture that even for the best possible 
A fiie~ A fi nd pairs, the speed-up value of large memory redundancies 
is very small, and for large data sets with long word lengths there 
are no practical alternatives to large searches that inspect large 
parts of the memory. 

We apologize for not having a more precise statement of the con- 
jecture, or good suggestions for how to prove it, for we feel that 
this is a fundamentally important point in the theory of computa- 
tion, especially in clarifying the distinction between serial and 
parallel concepts. 

Our belief in this conjecture is based in part on experience in find- 
ing fallacies in schemes proposed for constructing fast best- 
matching file and retrieval algorithms. To illustrate this we discuss 
next the proposal most commonly advanced by students. 

12.7.7 The Numerical-Order Scheme 

This proposal is a natural attempt to extend the method of Case 3 
(12.6.3) from exact match to best match. The scheme is 

A file : store the words of the data set in numerical order. 

Afind- given a word w, find (by some procedure) those words whose 
first a bits agree most closely with the first a bits of w. How 
to do this isn’t clear, but it occurs to one that (since this is 
the same problem on a smaller scale!) the procedure could 
be recursively defined. Then see how well the other bits of 
these words match with w. Next, ...(?) 

The intuitive idea is simple: the word w in the data set that is 
closest to w ought to show better-than-chance agreement in the 



[224] 12.7 Learning Theory 


first a bits, so why not look first at words known to have this 
property. There are two disastrous bugs in this program: 

1. When can one stop searching? What should we fill in where we 

wrote “Next We know no nontrivial rule that 

guarantees getting the best match. 

2. The intuitive concept, reasonable as it may appear, is not 
valid! It isn’t even of much use for finding a good match, let alone 
finding the best match. 

To elaborate on point 2, consider an example: let a = 20, b = 10,000. 
Let w, for simplicity, be the all-zero word. A typical word in the data 
set will have a mean of 5000 one's, and 5000 zero's. The standard 
deviation will be 10,000) 1/2 = 50. Thus, less than one word in 2 a = 2 20 
can be expected to have fewer than 4750 one's. Hence, the closest word 
in the data set will (on the average) have at least this many one's. That 
closest word will have (on the average) >20 • (4750/10,000) = 9.5 one's 
among its first 20 bits! The probability that w will indeed have very 
few one's in its first 20 bits is therefore extremely low, and the slight 
favorable bias obtained by inspecting those words first is quite utterly 
negligible in reducing the amount of inspection. Besides, objection 1 still 
remains. 

The value of ordering the first few bits of the words is quite useless, 
then. Classifying words in this way amounts, in the ^-dimensional 
geometry, to breaking up the space into “cylinders” which are not well 
shaped for finding nearby points. We have, therefore, tried various ar- 
rangements of spheres, but the same sorts of trouble appear (after more 
analysis). In the course of that analysis, we are led to suspect that there 
is a fundamental property of /7-dimensional geometry that puts a very 
strong and discouraging limitation upon all such algorithms. 

12.7.8 Why is Best Match so Different from Exact Match? 

If our unproved conjecture is true, one might want at least an 
intuitive explanation for the difference we would get between 
§12.6.3 and §12.7.3. One way to look at it is to emphasize that, 
though the phrases “best match” and “exact match” sound simi- 
lar to the ear, they really are very different. For in the case of 
exact match, no error is allowed, and this has the remarkable 
effect of changing an n-dimensional problem into a one-dimensional 
problem! For best matching we used the formula 

b b 

Error = ^ \x t — x t -l = ^ 1 \x t - x ( \. 



Linear Separation and Learning 12.8 [225] 


where we have inserted the coefficient “1” to show that all errors, 
in different dimensions, are counted equally. But for exact match, 
since no error can be tolerated, we don’t have to weight them 
equally: any positive weights will do! So for exact match we could 
just as well write 

b b 

Error = ^ 2 ; | x,- - Jc,-| or even Error = 2 \xj - x { ) 

i= i /= i 

because either of these can be zero only when all x t = x h (Shades 
of stratification.) But then we can (finally) rewrite the latter as 

Error = (2 2‘x,) - (2 2%) 

and we have mapped the w-dimensional vector (x,, . . . ,x b ) into a 
single point on a one-dimensional line. Thus these superficially 
similar problems have totally different mathematical personali- 
ties! 

12.8 Incremental Computation 

All the A file algorithms mentioned have the following curiously 
local property. They can be described roughly as computing the 
stored information M as a function of a large data set: 

M = Afile (data set) 

Now one can imagine algorithms which would use a vast amount 
of temporary storage (that is, much more than M or much more 
than is needed to store the data set) in order to compute M. Our 
Afii e algorithms do not. On the contrary, they do not even use 
significantly more memory capacity than is needed to hold their 
final output, M. They are even content to examine just one mem- 
ber of the data set at a time, with no control over which they 
will see next, and without any subterfuge of storing the data in- 
ternally. 

It seems to us that this is an interesting property of computation 
that deserves to be studied in its own right. It is striking how 
many apparently “global” properties of a data set can be com- 
puted “incrementally” in this sense. Rather than give formal 
definitions of these terms, we shall illustrate them by simple 
examples. 



[226] 12.8 Learning Theory 


Suppose one wishes to compute the median of a set of a million 
distinct numbers which will be presented in a long, disordered 
list. The standard solution would be to build up in memory a 
copy of the entire set in numerical order. The median number 
can then be read off. This is not an incremental computation 
because the temporary memory capacity required is a million 
times as great as that required to store the final answer. More- 
over it is easy to see that there is no incremental procedure if 
the data is presented only once. 

The situation changes if the list is repeated as often as we wish. 
For then two registers are enough to find the smallest number on 
one pass through the list, the second smallest on a second pass, 
and so on. With an additional register big enough to count up to 
half the number N of items in the list, we can eventually find 
the median. 

It might seem at first sight, however, that an incremental compu- 
tation is precluded if the numbers are presented in a random se- 
quence, for example by being drawn, with replacement, from a 
well-stirred urn. But a little thought will show that an even more 
profligate expenditure of time will handle this case incrementally 
provided we can assume (for example) that we know the number 
of numbers in the set and are prepared to state in advance an 
acceptable probability of error. 

What functions of big “data sets” allow these drastic exchanges 
of time for storage space? Readers might find it amusing to con- 
sider that to compute the best plane (§12.2.3), given random 
presentation of samples, and bounds on the coefficients, requires 
only about three solution-sized memory spaces. One predicate we 
think cannot be computed without storage as large as the data 
set is: 

[the numbers in the data set, concatenated in numerically 
increasing order, form a prime numberl. 

In case anyone suspects that all functions are incrementally com- 
putable (in some sense) let him consider functions involving 
decisions about whether concatenations of members of the data 
set are halting tapes for Turing machines. 



Perceptrons and Pattern Recognition 

13 


13.0 Introduction 

Many of the theorems show that perceptrons cannot recognize cer- 
tain kinds of patterns. Does this mean that it will be hard to build 
machines to recognize those patterns? 

No. Ail the patterns we have discussed can be handled by quite 
simple algorithms for general-purpose computers. 

Does this mean , then , that the theorems are very narrow , applying 
only to this small class of linear-separation machines? 

Not at all. To draw that conclusion is to miss the whole point of 
how mathematics helps us understand things! Often, the main 
value of a theorem is in the discovery of a phenomenon itself 
rather than in finding the precise conditions under which it occurs. 

Everyone knows, for example, about the “Fourier Series phe- 
nomenon” in which linear series expansions over a restricted class 
of functions (the sines and cosines) are used to represent all the 
functions of a much larger class. But only a very few of us can 
recall the precise conditions of a theorem about this! The impor- 
tant knowledge we retain is heuristic rather than formal — that a 
fruitful approach to certain kinds of problems is to seek an ap- 
propriate base for series expansions. 

That sounds very sensible. But how might it apply to the theorems 
about perceptrons? 

For example, the stratification theorem shows that certain predi- 
cates have lower order than one’s geometric intuition would sug- 
gest; one can encode information in a “nonstandard” way by 
using very large coefficients. The conditions given for Theorem 
7.2 are somewhat arbitrary, and many predicates can be realized 
in similar ways without meeting exactly these conditions. The 
theorem itself is just a vehicle for thoroughly understanding 
an instance of the more general encoding phenomenon. 

Does it apply also to the negative results? 

Yes, although it is harder to tell when “phenomena of limita- 
tions” will extend to more general machine-schemas. After we 
circulated early drafts of this book, we heard that some percep- 
tron advocates made statements like “Their conclusions hold only 



[228] 13.0 Perceptrons and Pattern Recognition 


if their conditions are exactly satisfied; our machines are not 
exactly the same as theirs.” But consider, for example, the kind 
of limitation demonstrated by the And/Or theorem. Although 
the limitation as stated could be circumvented by adding another 
layer of logic to the machine scheme to permit “and”-ing two 
perceptrons together, this would certainly miss the point of the 
phenomenon. To be sure, the new machine will realize some 
predicates that the simpler machines could not. But if the and/or 
phenomenon is understood, then the student will quickly ask: Is 
the new machine itself subject to a similar closure limitation? 
We expect that no moderate extension of the machine-schema 
in such a direction would really make much difference to its 
ability to handle context-dependence. 

We believe (but cannot prove) that the deeper limitations extend 
also to the variant of the perceptron proposed by A. Gamba. We 
discuss this in the next section. 

13.1 Gamba Perceptrons and other Multilayer Linear Machines 

In a series of papers (1960, 1961), A. Gamba and his associates 
describe experiments with a type of perceptron in which each <p is 
itself a thresholded measure , that is, a perceptron of order 1: 


(Pi = 

* = 


Z 

i 

Z 

i 


PijXj > @i -> 



PijXj > di 


> 0 


This scheme lends itself to physically realizable linear devices. For 
example, each <p could be realized by an optical filter and thresh- 
old photodetector (Figure 13.1). 

Filters have been proposed that range from completely random 
patterns to carefully selected “feature detectors,” moment inte- 
grals, and templates. One can even obtain complex values for the 
Py s by using paired masks, or phase-coherent optics. We would 
like to have a good theory of these machines, especially because 
very large arrays can be obtained so economically by optical and 
similar techniques. 



Perceptrons and Pattern Recognition 13.1 [229] 



Unfortunately, we do not know how to adapt the algebraic 
methods that worked on order-limited perceptrons, nor have we 
found any other analytic techniques. We can thus make only a 
few observations and ask questions. Note first that if the inner 
threshold operations are eliminated, we have simply an order-1 
perceptron: 


01 i H @ijXj > 9 

i j 



’Y^a'jXj > 6 
j 


That the nonlinear operations have a real effect is shown by the 
fact that a simple Gamba machine can recognize the predicate 

tABdx) = [(I* n a \ > \x n fib V (\xn c\>\xn a 1)1 

that was shown in Chapter 4 to be not of finite order. For if we 


define 

1 

r i 

if xjt A 

f 1 

if Xj€ C 

Pv = i 

-i 

if Xjt B 

/s v =\-\ 

if Xj = B 

1 

l 0 

otherwise 

{ 0 

otherwise 



[230] 13.1 Perceptrons and Pattern Recognition 


then it follows that \p ABC = \<P\ + V 2 > 01 is recognizable by a 
Gamba machine. 

Another predicate of unbounded order, ^ par , TY i can b e realized by 
simple Gamba machines: in fact any predicate \p( X) = \p(\X\) 
that depends only on |X| can be realized as follows: define 


<P{n)(X) = flA'I > n\ = ISx, > n] 


and define 


«o = i^(0), 

a 1 = ’/'(l) - a 0 , 

* — 

«n+l = 'Pin + 1) — 22 
1 = 0 


Then we can write 


WX) 


'l«l 

z 


a,<Po) 



If we place no limitation on the number of Gamba-masks, then 
the machines can recognize any pattern, since we can define, for 
each figure F, a template that recognizes exactly that figure: 


<Pf = 


Z X ‘ ~ Z X i ^ l^i 

x it F XjiF 


Then any class F of figures is recognized by 


\p ¥ = 


Z 


<Pf > 0 


This is not an interesting result, for it says only that any class has 
a disjunctive Boolean form, and can be recognized if one allows 
as many Gamba masks as there are figures in the class. But it is 
interesting that the area-dependent classes like imparity ar >d 'Pabc 
require at most \R\ masks, as shown above. It is not hard to 
prove that imparity requires at least log \R \ Gamba masks, but it 



Perceptrons and Pattern Recognition 13.2 [231] 


would be nice to have a sharper result. We are quite sure that for 
predicates involving more sophisticated relations between sub- 
parts of figures the Gamba machines, like the order-limited ma- 
chines, are relatively helpless, but the precise statement of this 
conjecture eludes us. For instance, we think that ^connected would 
require an enormous number of Gamba-masks — perhaps nearly 
as many as the class of connected figures. Such conjectures, stated 
in terms of the number of <p' s in the machine, seem harder to 
deal with than the simpler categorical impossibility of recogni- 
tion with any number of masks of bounded order. 

13.2 Other Multilayer Machines 

Have you considered “ perceptrons ” with many layers? 

Well, we have considered Gamba machines, which could be 
described as “two layers of perceptron.” We have not found (by 
thinking or by studying the literature) any other really interesting 
class of multilayered machine, at least none whose principles seem 
to have a significant relation to those of the perceptron. To see 
the force of this qualification it is worth pondering the fact, 
trivial in itself, that a universal computer could be built entirely 
out of linear threshold modules. This does not in any sense reduce 
the theory of computation and programming to the theory of per- 
ceptrons. Some philosophers might like to express the relevant 
general principle by saying that the computer is so much more 
than the sum of its parts that the computer scientist can afford 
to ignore the nature of the components and consider only their 
connectivity. More concretely, we would call the student’s atten- 
tion to the following considerations: 

1. Multilayer machines with loops clearly open all the questions 
of the general theory of automata. 

2. A system with no loops but with an order restriction at each 
layer can compute only predicates of finite order. 

3. On the other hand, if there is no restriction except for the 
absence of loops, the monster of vacuous generality once more 
raises its head. 

The problem of extension is not merely technical. It is also 
strategic. The perceptron has shown itself worthy of study 
despite (and even because of!) its severe limitations. It has many 
features to attract attention: its linearity; its intriguing learning 



[232] 13.3 Perceptrons and Pattern Recognition 


theorem; its clear paradigmatic simplicity as a kind of parallel 
computation. There is no reason to suppose that any of these 
virtues carry over to the many-layered version. Nevertheless, we 
consider it to be an important research problem to elucidate (or 
reject) our intuitive judgment that the extension is sterile. Per- 
haps some powerful convergence theorem will be discovered, or 
some profound reason for the failure to produce an interesting 
“learning theorem” for the multilayered machine will be found. 

13.3 Analyzing Real-World Scenes 

One can understand why you , as mathematicians , would be inter- 
ested in such clear and simple predicates as ^parity an d 
^connected- But w ^ at tf one wants t0 build machines to recog- 
nize chairs and tables or people ? Do your abstract predicates have 
any relevance to such problems , and does the theory of the simple 
perceptron have any relevance to the more complex machines one 
would use in practice? 

This is a little like asking whether the theory of linear circuits 
has relevance to the design of television sets. Absolutely, some 
concept of connectedness is required for analyzing a scene with 
many objects in it. For the whole is just the sum of its parts and 
their interrelations, and one needs some analysis related to con- 
nectedness to divide the scene into the kinds of parts that corre- 
spond to physically continuous objects. 

Then must we conclude from the negative results of Chapter 5 that 
it will be very hard to make machines to analyze scenes? 

Only if you confine yourself to perceptrons. The results of Chap- 
ter 9 show that connectivity is not particularly hard for more 
serial machines. 

But even granting machines that handle connectivity , isn’t there an 
enormous gap from that to machines capable of finding the objects 
in a picture of a three-dimensional scene? 

The gap is not so large as one might think. To explain this, we 
will describe some details of a kind of computer program that can 
do it. The methods we will describe fall into the area that is today 
called “heuristic programming” or “artificial intelligence.”* 

*See, for example. Computers and Thought (Feigenbaum and Feldman, 1963) and 
Semantic Information Processing (Minsky, 1968). 



Perceptrons and Pattern Recognition 13.3 [233] 


Consider the problem of designing a machine that, when pre- 
sented with a photograph, will be capable of describing the fol- 
lowing scene. 



One would want the machine to say, at least, that this scene shows 
four objects — three of them rectangular and the fourth cylindrical 
— and to say something about their relative positions. 

The tradition of heuristic programming would suggest providing 
the machine with a number of distinct abilities such as the follow- 
ing; 

1. The ability to detect points at which the light conditions 
change rapidly enough to suggest an edge. 


2. The ability to “cluster” the set of proposed edge-points into 
subsets that may each be taken as a hypothetical line segment or 
curve. 



i i/ 

3. The ability to pass from the “line drawing” to a list of con- 
nected regions or “faces.” 



4. The ability to cluster faces into objects. In §13.4, we will de- 
scribe such a method, developed by one of our students, Adolfo 
Guzman. This procedure is remarkably insensitive to partial 
covering of one object by another. 




[234] 13.3 Perceptrons and Pattern Recognition 


5. The ability to recognize certain features, such as shadows, as 
artifacts. 

Perhaps most important is 

6. The ability to make each of the above decisions on a tentative 
“working” basis and retract them if something “implausible” 
happens in any phase of the procedure. For example, if a region 
in Step 3 turns out to have an unusually complicated shape (rela- 
tive to the class of objects for which it is designed) the existence 
of some of the lines proposed in Step 2 might be challenged, or 
others might be proposed, to be verified by re-activating Step 1 
with a lower threshold. 

7. All these processes might be organized by a supervisory pro- 
gram like the “General Problem Solver” of Newell, Shaw, and 
Simon (1959) or the executive program of a large programming 
system. 

A system with such a set of abilities is in a very different class from 
a perceptron, if only because of the variety of operations it performs and 
forms of knowledge it uses. People often suggest that the methods of 
artificial intelligence and those associated with the perceptron are not 
as opposed as we think. For example, they say a perceptronlike algorithm 
might be used at each “lever’ to make the separate kinds of distinction. 
But using a perceptron as a component of a highly structured system 
entirely degrades its pretention to be a “self-organizing” system. If one 
is going to design such a system, one might as well be pragmatic in 
choosing an appropriate algorithm at each stage. 

The spirit of the approach we have in mind is illustrated by the 
role in the following example of sequential operations, hy- 
potheses, and hierarchical descriptions. 

13.4 Guzman’s Approach to Scene-Analysis 

In scenes like this. 



Perceptrons and Pattern Recognition 13.4 [235] 


where all the objects are rectangular solids, and do not occlude 
one another badly, we can discover the objects by the extremely 
local process of locating all the “Y-joints.” Each object contains 
at most one such distinctive feature. This could, of course, fail 
because of perspective, as in 


which could be a cube, or in 



(for we require each of the three angles of a Y -joint to span 
less than 180 degrees). A more serious failure is in the case of 
occlusion, as in 



where one of the Y -joints is completely hidden from view. But the 
great power of programs capable of hierarchical decisions is illus- 
trated by the possibility of first recognizing the small cube , then 
removing it , next extending the hidden lines , and so discovering 
the large cube! 



The program developed by Adolfo Guzman proceeds in a rather 
different way; his idea is to treat different kinds of local configura- 



[236] 13.4 Perceptrons and Pattern Recognition 


tions as providing different degrees of evidence for “linking” the 
faces that meet there. 

For example, in these three types of vertex configurations 




arrow 




I 

1 

T 


the “Y” provides evidence for linking I to II, II to III, and I to 
III. The “arrow” just links I to II. Because a “T” is ordinarily 
the result of one object occluding part of another, it is regarded 
as evidence against linking I to III or II to III (and it is neutral 
about I and II). Using just these rules, we can convert pictures 
into associated groups of faces as follows: we represent Y links by 
straight lines and arrow links by curves. 




Perceptrons and Pattern Recognition 13.4 [237] 



So far, there has been no difficulty in associating linked faces 
with objects. The usefulness of the variety of kinds of evidence 
shows up only in more complicated cases: In this figure 



we find some “false” links due to the merging of vertices on dif- 
ferent objects. To break such false connections, the program 
uses a hierarchical scheme that first finds subsets of faces that 
are very tightly linked (e.g., by two or more links). These “nuclei” 
then compete for more weakly linked faces. There is no competi- 
tion in Examples 1-4, but in 5 the single false links between the 
cubes are broken by his procedure. In Example 6 the “false” links 
are broken also. If a very simple “competition” algorithm were 
not adequate here, one could also take into account the negative 
evidence the two T-joints provide against linking I— III and II— III. 




We have described only the skeleton of his scheme; Guzman uses 
several other kinds of links, including evidence from collinear 
T-joint lines of the form 



[238] 13.4 Perceptrons and Pattern Recognition 




© © 


<D © 


and the effects of some vertices are modified by their associations 
with others. This variety of resources enables the program to 
solve scenes like 




in which some object faces are completely disconnected. 

The method of combining evidence used by this procedure might 
suggest a similarity to the perceptron’s weighting of evidence. 
However, the local character of this similarity most strongly 
marks the deep difference between the approaches. For Guzman’s 
algorithm uses this as but a small part of a procedure to evaluate 
links between abstract entities called faces which in turn have 
been proposed by another program that uses processes related to 
connectedness. 


In “locally ambiguous” figures, something more akin to reason- 
ing and problem solving is required. For example, the stack of 
cubes in Figure 13.2 can be falsely structured by the human visual 
process if one looks only at the center. This structure cannot be 
extended to the whole figure, suggesting that we, too, use a pro- 



Figure 13.2 


Perceptrons and Pattern Recognition 13.6 [239] 


cedure that in case of failure can “back up” to a different hy- 
pothesis. 

It is outside the scope of this book to discuss heuristic program- 
ming in further detail. Readers who wish for more information 
should consult the references. The su^ec^r o ^ scene cd-(s r* 
Vlas <?civ/avfreJ dirq <*4/ ca ve^ev-eflces adr ev\f . 

13.5 Why Prove Theorems? 

Why did you prove all these complicated theorems ? Couldn’t you 
just take a perceptron and see if it can recognize ^connected? 

No. 

13.6 Sources and Development of Ideas 

Our debts to previous work and to other people, places, and 
institutions can best be described by a brief historical sketch of 
our work. Our collaboration began in 1963 when we were brought 
together by Warren S. McCulloch. We have a special debt to him 
not only because of this but because he was the first to think 
seriously about the problems we have studied. 

13.6.1 The Group-Invariance Theorem 

We had both been interested in the perceptron since its announce- 
ment by Rosenblatt in 1957. In fact we had both presented papers 
related to its “learning” aspect at a symposium on information 
theory in London in 1960. Our serious attack on its geometric 
problems started in the spring of 1965. At that time, it was 
generally known that order- 1 perceptrons could not compute 
translation-invariant functions other than functions of \X\, but 
there was no hint as to how this might generalize. 

In retrospect the most obvious obstacle was the lack of a concept 
of order. Earlier studies on the power of perceptrons were based 
on <L-sets of partial functions defined by stochastic generative 
processes or subject to irrelevant conditions such as that they 
themselves be linear threshold functions of a small number of 
variables. Such limitations (as opposed to our |S(<p)| < k) seemed 
always to produce mathematically intractable situations and so 
reinforced the dominant tendency to approach the problem as 
one of statistics rather than algebra. In making the shift, we feel 
most closely anticipated by Bledsoe and Browning (1959) who 



[240] 13.6 Perceptrons and Pattern Recognition 


considered a pure order limitation on a type of conjunctively 
local machine. 

With the concept of order in mind, the general form of the 
group-invariance theorem became possible. But we first had to 
overcome at least four other obstacles of heuristically different 
kinds. 

1 . We had to accept the value of studying the geometrically trivial 
predicate ^ PARITY . No reference to this predicate is logically neces- 
sary (or even helpful) in proving the group-invariance theorem, 
the And/Or theorem, or in explaining the principle of stratifica- 
tion. But we are convinced that its heuristic role was critical. Its 
very geometric triviality enabled us to see the algebraic principles 
in all these situations. The same comment applies to the role of 
the positive normal form: all our results can be proved without it. 
But at a time when we were thoroughly confused about every- 
thing, it allowed us to replace the bewildering variety of the set 
of all possible logical functions by the combinatorial tidiness of 
the masks. 

2. The idea of averaging had been in our minds ever since reading 
Pitts and McCulloch (1947). It was reinforced by a fine proof 
offered by Tom Marill in response to an exposition, at an M.I.T. 
seminar, of our early thoughts on the subject. Marill observed 
that for any <p, if |S(<p)| < \R\ then \X | <p(X ) j contains the same 
number of even- and odd-sized X's. It follows immediately that 
for any set 4> of such ip* s, the sets of vectors 


j<J>(20| \X\ even) and \$(X)\ \X\ odd) 


must have the same center of gravity. They therefore cannot be 
separated by a hyperplane! 

Although highly suggestive, this proof is still marked by the basic 
weakness of all early thinking about perceptrons, that is, the pre- 
occupation with the image of predicates as sets of points in the 
|<£ [-dimensional Euclidean space. To obtain the group-invariance 
theorem, we had to break away from this image. Marill’s proof 
averages over a set of |<f> |-points; ours averages over a set of 
functionals defined on the subsets of \R\. 



Perceptrons and Pattern Recognition 13.6 [241] 


3. An ‘‘obstacle” of a very different sort was lack of contact 
with classical mathematical methods of proven depth. The prox- 
imity of fundamental properties of polynomials, irreducible 
curves, Haar integrals, etc., brought the feeling of ‘‘real mathe- 
matics” as opposed to the “purely combinatorial” methods of 
earlier papers on perceptrons. This is still sufficiently rare in com- 
puter science to be significant. We are convinced that respect for 
“real mathematics” is a powerful heuristic principle, though it 
must be tempered with practical judgment. 

4. We were reluctant to attach the condition to the group-invar- 
iance theorem that be closed under the group , for this seemed 
like a strong restriction. It took us a while to realize that this 
made the group-invariance theorem stronger rather than weaker! 
For the theorem is used mainly to show that various predicates 
cannot be realized by various perceptrons. The closure condition 
then says that such a predicate cannot be realized even by a 
perceptron that has all the appropriate (p's of each type. There- 
fore, it cannot be realized by any smaller perceptron, say, one 
with a random selection from such partial predicates. 

13 . 6.2 }f / CONNECTED 

Our first gropings were driven largely by frustration at not being 
able to prove, even heuristically, the “obvious” fact that this 
predicate is not of finite order. The diversion to iAp ARITY was partly 
motivated by wanting a simpler case for study, partly by the hope 
of finding a reduction through a switching argument similar to 
that used to prove the diameter-limited theorem. Our first 
theorem about the order of ^connected came by a different route, 
via the one-in-a-box theorem. Although this solved our original 
problem (in a logical sense) we continued exploring switching 
arguments, following an intuition which later paid dividends in 
ways we cannot pretend to have explicitly anticipated. 

While we were developing the rather complex switching circuits explained 
in Chapter 5 we entirely missed the simpler argument suggested to us 
by David Huffman (§5.5). Although Huffman’s construction gives only a 
weak lower bound to the growth rate of the order of ^connected » it 
provides sufficient proof that the predicate is not of finite order. More- 
over, it shows how any predicate on a retina of \R | points can be reduced 

to computing ^connected on a retina of about 2 |/?1 points. Thus this 
predicate is shown formally to have a kind of universality for these 
parallel machines akin to that possessed by tree-search in the usual 
serial machine (and also to suffer from similar exponential disasters). 



[242] 13.6 Perceptrons and Pattern Recognition 


13.6.3 More Topology 

A by-product of work on connectedness was the pleasing (and 
perhaps puzzling) positive result about the Euler predicate. In 
an early draft of the book we appended to this a mildly false 
proof that no other topological invariants could be in 

£(^ 0 ’ <P m ). 

When we discovered the correct proof of the theorem (§8.4) that 
there were no other diameter-limited topological predicates we 
firmly conjectured that quite different proof methods would be 
necessary to prove this for the order-limited ease. So we were 
astonished when Mike Paterson, a young British computation 
theorist who had offered to criticize the manuscript, showed how 
the ideas in §5.7 could be used to reduce this to the parity 
switching net, and thus to prove the theorem of §5.9. 

13.6.4 Stratification 

This is the area in which our early intuitions proved most dis- 
astrously wrong. 

Our first formal presentation of the principal results in this book 
was at an American Mathematical Society symposium on Mathe- 
matical Aspects of Computer Science in April 1966. At this time 
we could prove that ^connected was not of finite order and con- 
jectured that the same must be true of the apparently “global” 
predicates of symmetry and twins described in §7.3. 

For the rest of 1966 the matter rested there. We were pleased 
and encouraged by the enthusiastic reception by many colleagues 
at the A.M.S. meeting and no less so by the doleful reception of 
a similar presentation at a Bionics meeting. However, we were 
now involved in establishing at M.I.T. an artificial intelligence 
laboratory largely devoted to real “seeing machines,” and gave no 
attention to perceptrons until we were jolted by attending an 
I.E.E.E. Workshop on Pattern Recognition in Puerto Rico early 
in 1967. 

Appalled at the persistent influence of perceptrons (and similar 
ways of thinking) on practical pattern recognition , we determined 
to set out our work as a book. Slightly ironically, the first results 
obtained in our new phase of interest were the pseudo-positive 
applications of stratification. 



Perceptrons and Pattern Recognition 13.6 [243] 


Our first contact with the phenomenon was when our student 
John White showed the predicate ^hollow square t0 have order 
three. We had believed it to be order four for reasons the reader 
will see if he tries to realize it with coefficients bounded inde- 
pendently of the size of the square. Perhaps we were so convinced 
of the extreme parallelism of the perceptron that we resisted see- 
ing how certain limited forms of serial computation could be 
encoded into the perceptron algorithm by constructing a size- 
domination hierarchy. 

Whatever the reason, it took us many months to isolate the strati- 
fication principle and so understand why we had failed to prove 
the group-invariance theorem for infinite retinas. It is clear that 
stratification is not a mere “trick” that can reduce the order of a 
predicate, but that unbounded coefficients admit an essentially 
wider range of sequential (conditional) computations, although at 
such a price that this is of only mathematical interest. We are 
convinced that most of the predicates in Chapter 7 have, under 
the bounded condition, no finite orders, and stratification makes 
the difference between finite and infinite. We have not actually 
proved this, however. 

13.6.5 Learning and Memory 

The spirit of the theory in Chapters 11 and 12 is very different 
from that of our geometric theory. To begin with, the research 
objectives seem to face in an opposite direction: learning theorems 
have statements like “If a given predicate is in an L(#) then a 
certain procedure will find a set of coefficients to represent it.” 
Whereas, in the main part of our work the main questions were 
directed towards understanding when and why certain predicates 
are in certain L($)’s. Also the proper context for the learning 
theory seems indeed to be the ^-dimensional coefficient-space in 
which figures and predicates become points and hyperplanes (or 
dually). We have emphasized several times that progress towards 
the geometric theory seemed to come only when we could break 
away from this representation. However, we decided to discuss 
the convergence theorem at first mainly because we felt dissatis- 
fied with the uncritical form of all previous presentations. In 
particular, it was customary to ignore such questions as 

Is the perceptron an efficient form of memory? 



[244] 13.6 Perceptrons and Pattern Recognition 


Does the learning time become too long to be practical even when 
separation is possible in principle? 

How do the convergence results compare to those obtained by 
more deliberate design methods? 

What is the perceptrons relation to other computational devices? 

As time went on, the last question became more and more im- 
portant to us. The comparison with the homeostat reinforced our 
interest in the perceptron as a good object for study in the mathe- 
matical theory of computation rather than as a practical machine, 
and we became interested in such questions as: Could we see the 
perceptron convergence theorem as a manifestation of a finite 
state situation? How is it related to hill climbing and other 
search techniques? Does it differ fundamentally from other 
methods of solving linear inequalities? 

We rather complacently thought that the first question would 
have an easy answer until our student, Terry Beyer, drew our 
attention to some difficulties and soundly conjectured that the 
way out would be to prove something like what eventually be- 
came Theorem 1 1.6. A concerted effort to settle the problem with 
the help of Dona Strauss led to an interesting but unpublishable 
proof. Soon after this we heard that Bradley Efron had proved 
a similar theorem and found that by borrowing an idea from him 
we could generate the demonstration given in §11.6. Efron, who 
did not publish his proof, credits Nils Nilsson with the conjecture 
that led him to the theorem. 

We now see the perceptron learning theorem as a simple example 
of a larger problem about memory (or information storage and 
retrieval), much as the nonlearning perceptron is a paradigm in 
the theory of parallel computation. Chapter 12 can be regarded 
as a manifesto on the importance of this problem. 

13.7 Computational Geometry 

We like to think that the perceptron illustrates the possibility 
of a more organic interaction between traditional mathematical 
topics and ideas of computation. When we first talked about 
^connected we thought we were studying an isolated fact about a 
type of computer device. By the time we had conjectured and 
started to prove that the Eulerian predicates were the only topo- 
logical ones of finite order, we felt we were studying geometry. 



Perceptrons and Pattern Recognition 13.9 [245] 


This might represent a bad tendency of people trained in classical 
mathematics to drag the new subject of computation back into 
their familiar territory. Or it could prefigure a future of com- 
putational thought not just as a new and separate autonomous 
science but as a way of thinking that will permeate all the other 
sciences. 

The truth must lie between these extremes. In any case, we are 
excited to see around us at M.I.T. a steady growth of research 
whose success is a confirmation of the value of the concept of 
“computational geometry” as well as of the talent of our col- 
leagues and students. 

Manuel Blum and Carl Hewitt obtained the first new result 
by studying the geometric ability of finite-state machines. More 
recently, Blum, Bill Henneman, and Harriet Fell have found in- 
teresting properties of the relations induced on Euclidean figures 
by the imposition of a discrete grid. Terry Beyer has discovered 
very surprising algorithms for geometric computation in iterative 
arrays of automata. Needless to say, these people have con- 
tributed to the work described in this book as much by their 
indirect influence on the intellectual atmosphere around us as by 
many pieces of advice, comment, and criticism of which only a 
small proportion is reflected in direct acknowledgments in the 
text. Many points of mathematical technique and exposition were 
improved by suggestions from W. Armstrong, R. Beard, L. 
Lyons, M. Paterson, and G. Sussman. 

13.8 Other People 

In addition to those already mentioned, we owe much to 

W. W. Bledsoe, 

Dona Strauss, 

O. G. Selfridge, 

R. J. Solomonoflf, 

R. M. Fano. 

13.9 Other Places 

For political and heuristic reasons, we mention that most of the 
new ideas came in new environments: beaches, swamps, and 
mountains. 



[246] 13.10 Perceptions and Pattern Recognition 


13.10 Institutions 

Even if we had not been supported by the Advanced Research 
Projects Agency we would have liked to express the debt owed 
by computer science to its Information Sciences branch and to the 
imaginative band of men who built it: 

J. R. Licklider, 

I. E. Sutherland, 

R. W. Taylor, 

L. G. Roberts. 

In fact, ARPA has supported most of our work, through M.I.T.’s 
Project MAC and the Office of Naval Research. 



Epilogue: The New Connectionism 


When perceptron-like machines came on the scene, we found that 
in order to understand their capabilities we needed some new 
ideas. It was not enough simply to examine the machines them- 
selves or the procedures used to make them learn. Instead, we had 
to find new ways to understand the problems they would be asked 
to solve. This is why our book turned out to be concerned less with 
perceptrons per se than with concepts that could help us see the 
relation between patterns and the types of parallel-machine ar- 
chitectures that might or might not be able to recognize them. 

Why was it so important to develop theories about parallel ma- 
chines? One reason was that the emergence of serial computers 
quickly led to a very respectable body of useful ideas about 
algorithms and algorithmic languages, many of them based on a 
half-century’s previous theories about logic and effective computa- 
bility. But similarly powerful ideas about parallel computation did 
not develop nearly so rapidly — partly because massively parallel 
hardware did not become available until much later and partly 
because much less knowledge that might be relevant had been ac- 
cumulated in the mathematical past. Today, however, it is feasible 
either to simulate or to actually assemble huge and complex ar- 
rangements of interacting elements. Consequently, theories about 
parallel computation have now become of immediate and intense 
concern to workers in physics, engineering, management, and 
many other disciplines — and especially to workers involved with 
brain science, psychology, and artificial intelligence. 

Perhaps this is why the past few years have seen new and heated 
discussions of network machines as part of an intellectually aggres- 
sive movement to establish a paradigm for artificial intelligence and 
cognitive modeling. Indeed, this growth of activity and interest has 
been so swift that people talk about a “connectionist revolution.” 
The purpose of this epilogue, added in 1988, is to help present-day 
students to use the ideas presented in Perceptrons to put the new 
results into perspective and to formulate more clearly the research 
questions suggested by them. To do this succinctly, we adopt the 
strategy of focusing on one particular example of modern connec- 
tionist writing. Recently, David Rumelhart, James McClelland, 
and fourteen collaborators published a two-volume work that has 
become something of a connectionist manifesto: Parallel Distrib- 
uted Processing (MIT Press, 1986). We shall take this work (hence- 



[248] Epilogue 


forth referred to as PDP) as our connectionist text. What we say 
about this particular text will not, of course, apply literally to other 
writings on this subject, but thoughtful readers will seize the gen- 
eral point through the particular case. In most of this epilogue we 
shall discuss the examples in PDP from inside the connectionist 
perspective, in order to flag certain problems that we do not expect 
to be solvable within the framework of any single, homogeneous 
machine. At the end, however, we shall consider the same prob- 
lems from the perspective of the overview we call “society of 
mind,” a conceptual framework that makes it much more feasible 
to exploit collections of specialized accomplishments. 

PDP describes Perceptrons as pessimistic about the prospects for 
connectionist machinery: 

. . even though multilayer linear threshold networks are poten- 
tially much more powerful ... it was the limitations on what per- 
ceptrons could possibly learn that led to Minsky and Papert’s 
(1969) pessimistic evaluation of the perceptron. Unfortunately, 
that evaluation has incorrectly tainted more interesting and power- 
ful networks of linear threshold and other nonlinear units. As we 
shall see, the limitations of the one-step perceptrons in no way 
apply to the more complex networks.” (vol. 1, p. 65) 

We scarcely recognize ourselves in this description, and we recom- 
mend rereading the remarks in section 0.3 about romanticism and 
rigor. We reiterate our belief that the romantic claims have been 
less wrong than the pompous criticisms. But we also reiterate that 
the discipline can grow only when it makes a parallel effort to 
critically evaluate its apparent accomplishments. Our own work in 
Perceptrons is based on the interaction between an enthusiastic 
pursuit of models of new phenomena and a rigorous search for 
ways to understand the limitations of these models. 

In any case, such citations have given our book the reputation of 
being mainly concerned with what perceptrons cannot do, and of 
having concluded with a qualitative evaluation that the subject was 
not important. Certainly, some chapters prove that various impor- 
tant predicates have perceptron coefficients that grow unmanage- 
ably large. But many chapters show that other predicates can be 
surprisingly tractable. It is no more apt to describe our mathemat- 
ical theorems as pessimistic than it would be to say the same about 



Epilogue [249] 


deducing the conservation of momentum from the laws of mechan- 
ics. Theorems are theorems, and the history of science amply dem- 
onstrates how discovering limiting principles can lead to deeper 
understanding. But this happens only when those principles are 
taken seriously, so we exhort contemporary connectionist re- 
searchers to consider our results seriously as sources of research 
questions instead of maintaining that they “in no way apply.” 

What Perceptrons Can’t Do 

To put our results into perspective, let us recall the situation in the 
early 1960s; Many people were impressed by the fact that initially 
unstructured networks composed of very simple devices could be 
made to perform many interesting tasks — by processes that could 
be seen as remarkably like some forms of learning. 

A different fact seemed to have impressed only a few people: While 
those networks did well on certain tasks and failed on certain 
other tasks, there was no theory to explain what made the differ- 
ence — particularly when they seemed to work well on small (“toy”) 
problems but broke down with larger problems of the same kind. 

Our goal was to develop analytic tools to give us better ideas about 
what made the difference. But finding a comprehensive theory of 
parallel computation seemed infeasible, because the subject was 
simply too general. What we had to do was sharpen our ideas by 
working with some subclass of parallel machines that would be 
sufficiently powerful to perform significant computations, that 
would also share at least some of the features that made such 
networks attractive to those who sought a deeper understanding of 
the brain, and that would also be mathematically simple enough to 
permit theoretical analysis. This why we used the abstract defini- 
tion of perceptron given in this book. The perceptron seemed pow- 
erful enough in function, suggestive enough in architecture, and 
simple enough in its mathematical definition, yet understanding 
the range and character of its capabilities presented challenging 
puzzles. 

Our prime example of such a puzzle was the recognition of 
connectedness. It took us many months of work to capture in a 
formal proof our strong intuition that perceptrons were unable to 



[250] Epilogue 


represent that predicate. Perhaps the most instructive aspect of 
that whole process was that we were guided by a flawed intuition to 
the proof that perceptrons cannot recognize the connectivity in any 
general or practical sense. We had assumed that perceptrons could 
not even detect the connectivity of hole-free blobs — because, as 
we supposed, no local forms of evidence like those in figure 5.7 
could correlate with the correct decision. Yet, as we saw in subsec- 
tion 5.8.1, if a figure is known to have no holes, then a low-order 
perceptron can decide on its connectivity; this we had not initially 
believed to be possible. It is hard to imagine better evidence to 
show how artificial it is to separate “negative” from “positive” 
results in this kind of investigation. To explain how this experience 
affected us, we must abstract what we learned from it. 

First we learned to reformulate questions like “Can perceptrons 
perform a certain task?” Strictly speaking, it is misleading to say 
that perceptrons cannot recognize connectedness, since for any 
particular size of retina we can make a perceptron that will recog- 
nize any predicate by providing it with enough cps of sufficiently 
high order. What we did show was that the general predicate re- 
quires perceptrons of unbounded order. More generally, we 
learned to replace globally qualitative questions about what per- 
ceptrons cannot do with questions in the spirit of what is now 
called computational complexity. Many of our results are of the 
form where R is a measure of the size of the problem and 

M is the magnitude of some parameter of a perceptron (such as the 
order of its predicates, how many of them might be required, the 
information content of the coefficients, or the number of cycles 
needed for learning to converge). The study of such relationships 
gave us a better sense of what is likely to go wrong when one tries 
to enlarge the scale of a perceptron-like computation. In serial 
computing it was already well known that certain algorithms de- 
pending on search processes would require numbers of steps of 
computation that increased exponentially with the size of the prob- 
lem. Much less was known about such matters in the case of paral- 
lel machines. 

The second lesson was that in order to understand what percep- 
trons can do we would have to develop some theories of “problem 
domains” and not simply a “theory of perceptrons.” In previous 



Epilogue [251] 


work on networks, from McCulloch and Pitts to Rosenblatt, even 
the best theorists had tried to formulate general-purpose theories 
about the kinds of networks they were interested in. Rosenblatt’s 
convergence theorem is an example of how such investigations can 
lead to powerful results. But something qualitatively different was 
needed to explain why perceptrons could recognize the connect- 
edness of hole-free figures yet be unable to recognize con- 
nectedness in general. For this we needed a bridge between a the- 
ory about the computing device and a theory about the content of 
the computation. The reason why our group-invariance theorem 
was so useful here was that it had one foot on the geometric side 
and one on the computational side. 

Our study of the perceptron was an attempt to understand general 
principles through the study of a special case. Even today, we still 
know very little, in general, about how the costs of parallel compu- 
tation are affected by increases in the scale of problems. Only the 
cases we understand can serve as bases for conjectures about what 
will happen in other situations. Thus, until there is evidence to the 
contrary, we are inclined to project the significance of our results 
to other networks related to perceptrons. In the past few years, 
many experiments have demonstrated that various new types of 
learning machines, composed of multiple layers of perceptron-like 
elements, can be made to solve many kinds of small-scale prob- 
lems. Some of those experimenters believe that these perfor- 
mances can be economically extended to larger problems without 
encountering the limitations we have shown to apply to single- 
layer perceptrons. Shortly, we shall take a closer look at some of 
those results and see that much of what we learned about simple 
perceptrons will still remain quite pertinent. It certainly is true that 
most of the theorems in this book are explicitly about machines 
with a single layer of adjustable connection weights. But this does 
not imply (as many modern connectionists assume) that our con- 
clusions don’t apply to multilayered machines. To be sure, those 
proofs no longer apply unchanged, because their antecedent condi- 
tions have changed. But the phenomena they describe will often 
still persist. One must examine them, case by case. For example, 
all our conclusions about order-limited predicates (see section 0.7) 
continue to apply to networks with multiple layers, because the 
order of any unit in a given layer is bounded by the product of the 



[252] Epilogue 


OUTPUT 



Figure 1 Symmetry using order-2 disjunction. 

orders of the units in earlier layers. Since many of our arguments 
about order constrain the representations of group-invariant predi- 
cates, we suspect that many of those conclusions, too, will apply to 
multilayer nets. For example, multilayer networks will be no more 
able to recognize connectedness than are perceptrons. (This is not 
to say that multilayer networks do not have advantages. For ex- 
ample, the product rule can yield logarithmic reductions in the 
orders and numbers of units required to compute certain high-order 
predicates. Furthermore, units that are arranged in loops can 
be of effectively unbounded order; hence, some such networks 
will be able to recognize connectedness by using internal serial 
processing.) 

Thus, in some cases our conclusions will remain provably true and 
in some cases they will be clearly false. In the middle there are 
many results that we still think may hold, but we do not know 
any formal proofs. In the next section we shall show how some 
of the experiments reported in PDP lend credence to some such 
conjectures. 

Recognizing Symmetry 

In this section we contrast two different networks, both of which 
recognize symmetrical patterns defined on a six-point linear retina. 
To be precise, we would like to recognize the predicate X is sym- 
metric about the midpoint of R. Figure 1 shows a simple way to 
represent this is as a perceptron that uses R 9 units, each of order 
2. Each one of them will locally detect a deviation from symmetry 



Epilogue [253] 


OUTPUT 




Actual coefficients from PDP experiment 


Figure 2 Symmetry using order-7? stratification. 

at two particular retinal points. Figure 2 shows the results of an 
experiment from PDP. It depicts a network that represents 
Asymmetry m quite a different way. Amazingly, this network uses 
only two cp functions — albeit ones of order R. 

The weights displayed in figure 2 were produced by a learning 
procedure that we shall describe shortly. For the moment, we want 
to focus not on the learning problem but on the character of the 
coefficients. We share the sense of excitement the PDP experi- 
menters must have experienced as their machine converged to this 
strange solution, in which this predicate seems to be portrayed as 
having a more holistic character than would be suggested by its 
conjunctively local representation. However, one must ask certain 
questions before celebrating this as a significant discovery. In PDP 
it is recognized that the lower-level coefficients appear to be grow- 
ing exponentially, yet no alarm is expressed about this. In fact, 
anyone who reads section 7.3 should recognize such a network as 
employing precisely the type of computational structure that we 
called stratification. Also, in the case of network 2, the learning 
procedure required 1,208 cycles through each of the 64 possible 
examples — a total of 77,312 trials (enough to make us wonder if the 
time for this procedure to determine suitable coefficients increases 
exponentially with the size of the retina). PDP does not address 
this question. What happens when the retina has 100 elements? If 
such a network required on the order of 2 200 trials to learn, most 
observers would lose interest. 

This observation shows most starkly how we and the authors of 
PDP differ in interpreting the implications of our theory. Our “pes- 


[254] Epilogue 


OUTPUT 



Figure 3 Parity using Gamba masks. 

simistic evaluation of the perceptron” was the assertion that, al- 
though certain problems can easily by solved by perceptrons on 
small scales, the computational costs become prohibitive when the 
problem is scaled up. The authors of PDF seem not to recognize 
that the coefficients of this symmetry machine confirm that thesis, 
and celebrate this performance on a toy problem as a success 
rather than asking whether it could become a profoundly “bad” 
form of behavior when scaled up to problems of larger size. 

Both of these networks are in the class of what we called Gamba 
perceptrons in section 13.1 — that is, ordinary perceptrons whose cp 
functions are themselves perceptrons of order 1. Accordingly, we 
are uncomfortable about the remark in PDF that “multilayer linear 
threshold networks are potentially much more powerful than sin- 
gle-layer perceptrons.” Of course they are, in various ways — and 
chapter 8 of PDF describes several studies of multilayer percep- 
tron-like devices. However, most of them — like figure 2 above — 
still belong to the class of networks discussed in Perceptrons. 

Also in chapter 8 of PDF, similar methods are applied to the prob- 
lem of recognizing parity — and the very construction described in 
our section 13.1, through which a Gamba perceptron can recognize 
parity, is rediscovered. Figure 3 here shows the results. To learn 
these coefficients, the procedure described in PDF required 2,825 
cycles through the 16 possible input patterns, thus consuming 
45,200 trials for the network to learn to compute the parity predi- 
cate for only four inputs. Is this a good result or a bad result? We 
cannot tell without more knowledge about why the procedure re- 
quires so many trials. Until one has some theory of that, there is no 



Epilogue [255] 


way to assess the significance of any such experimental result; all 
one can say is that 45,200 = 45,200. In section 10.1 we saw that if a 
perceptron’s cp functions include only masks, the parity predicate 
requires doubly exponential coefficients. If we were sure that that 
was happening, this would suggest to us that we should represent 
45,200 (approximately) as 2 2 rather than, say, as 2 16 . However, 
here we suspect that this would be wrong, because the input units 
aren’t masks but predicates — apparently provided from the start — 
that already know how to “count.” These make the problem much 
easier. In any case, the lesson of Perceptrons is that one cannot 
interpret the meaning of such an experimental report without first 
making further probes. 

Learning 

We haven’t yet said how those networks learned. The authors of 
PDP describe a learning procedure called the “Generalized Delta 
Rule” — we’ll call it GD — as a new breakthrough in connectionist 
research. To explain its importance, they depict as follows the theo- 
retical situation they inherited: 

“A further argument advanced by Minsky and Papert against per- 
ceptron-like models with hidden units is that there was no indica- 
tion how such multilayer networks were to be trained. One of the 
appealing features of the one-layer perceptron is the existence of a 
powerful learning procedure, the perceptron convergence proce- 
dure of Rosenblatt. In Minsky and Papert’s day, there was no such 
powerful learning procedure for the more complex multilayer sys- 
tems. This is no longer true. . . . The GD procedure provides a 
direct generalization of the perceptron learning procedure which 
can be applied to arbitrary networks with multiple layers and feed- 
back among layers. This procedure can, in principle, learn arbi- 
trary functions including, of course, parity and connectedness.” 
(vol. 1, p. 113) 

In Minsky and Papert’s day, indeed! In this section we shall ex- 
plain why, although the GD learning procedure embodies some 
useful ideas, it does not justify such sweeping claims. But in order 
to explain why, and to see how the approach in the current wave of 
connectionism differs from that in Perceptrons, we must first ex- 
amine with some care the relationship between two branches of 
perceptron theory which could be called “theory of learning” and 
“theory of representation.” To begin with, one might paraphrase 



[256] Epilogue 


the above quotation as saying that, until recently, connectionism 
had been paralyzed by the following dilemma: 

Perceptrons could learn anything that they could represent, but 
they were too limited in what they could represent. 

Multilayered networks were less limited in what they could repre- 
sent, but they had no reliable learning procedure. 

According to the classical theory of perceptrons, those limitations 
on representability depend on such issues as whether a given predi- 
cate P can be represented as a perceptron defined by a given set <t> 
on a given retina, whether P is of finite order, whether P can be 
realized with coefficients of bounded size, whether properties of 
several representable predicates are inherited by combinations of 
those predicates, and so forth. All the results in the first half of our 
book are involved with these sorts of representational issues. 
Now, when one speaks about “powerful learning procedures,” the 
situation is complicated by the fact that, given enough input units 
of sufficiently high order, even simple perceptrons can represent — 
and therefore learn — arbitrary functions. Consequently, it makes 
no sense to speak about “power” in absolute terms. Such state- 
ments must refer to relative measures of sizes and scales. 

As for learning, the dependability of Rosenblatt’s Perceptron Con- 
vergence theorem of section 11.1 — let’s call it PC for short — is 
very impressive: If it is possible at all to represent a predicate P as 
a linear threshold function of a given set of predicates d>, then the 
PC procedure will eventually discover some particular set of 
coefficients that actually represents P. However, this is not, in 
itself, a sufficient reason to consider PC interesting and important, 
because that theorem says nothing about the crucial issue of 
efficiency. PC is not interesting merely because it provides a sys- 
tematic way to find suitable coefficients. One could always take 
recourse, instead, to simple, brute-force search — because, given 
that some solution exists, one could simply search through all pos- 
sible integer coefficient vectors, in order of increasing magnitude, 
until no further “errors” occurred. But no one would consider 
such an exhaustive process to be an interesting foundation for a 
learning theory. 



Epilogue [257] 


What, then, makes PC seem significant? That it discovers those 
coefficients in ways that are intriguing in several other important 
respects. The PC procedure seems to satisfy many of the intuitive 
requirements of those who are concerned with modeling what 
really happens in a biological nervous system. It also appeals to 
both our engineering aesthetic and our psychological aesthetic by 
serving simultaneously as both a form of guidance by error correc- 
tion and a form of hill-climbing. In terms of computational effi- 
ciency, PC seems much more efficient than brute-force procedures 
(although we have no rigorous and general theory of the condi- 
tions under which that will be true). Finally, PC is so simple mathe- 
matically as to make one wish to believe that it reflects something real. 

Hill-Climbing and the Generalized Delta Procedure 

Suppose we want to find the maximum value of a given function 
F(x,y,z, . . .) of n variables. The extreme brute-force solution is to 
calculate the function for all sets of values for the variables and 
then select the point for which F had the largest value. The ap- 
proach we called hill-climbing in section 11.3 is a local procedure 
designed to attempt to find that global maximum. To make this 
subject more concrete, it is useful to think of the two-dimensional 
case in which the x-y plane is the ground and z = F(x,y) is the 
elevation of the point (x,y,z) on the surface of a real physical hill. 
Now, imagine standing on the hill in a fog so dense that only the 
immediate vicinity is visible. Then the only resort is to use some 
diameter-limited local process. The best-known method is the 
method known as “steepest ascent,” discussed in section 11.6: 
First determine the slope of the surface in various directions from 
the point where you are standing, then choose the direction that 
most rapidly increases your altitude and take a step of a certain size 
in that direction. The hope is that, by thus climbing the slope, you 
will eventually reach the highest point. 

It is both well known and obvious that hill-climbing does not al- 
ways work. The simplest way to fail is to get stuck on a local 
maximum — an isolated peak whose altitude is relatively in- 
significant. There simply is no local way for a hill-climbing proce- 
dure to be sure that it has reached a global maximum rather than 
some local feature of topography (such as a peak, a ridge, or a 
plain) on which it may get trapped. We showed in section 11.6 that 
PC is equivalent (in a peculiar sense) to a hill-climbing procedure 
that works its way to the top of a hill whose geometry can actually 



[258] Epilogue 


be proved not to have any such troublesome local features — 
provided that there actually exists some perceptron-weight vector 
solution A* to the problem. Thus, one could argue that perceptrons 
work” on those problems not because of any particular virtue of 
the perceptrons or of their hill-climbing procedures but because the 
hills for those soluble problems have clean topographies. What are 
the prospects of finding a learning procedure that works equally 
well on all problems, and not merely on those that have linearly 
separable decision functions? The authors of PDP maintain that 
they have indeed discovered one: 

“Although our learning results do not guarantee that we can find a 
solution for all solvable problems, our analyses and results have 
shown that, as a practical matter, the error propagation scheme 
leads to solutions in virtually every case. In short, we believe that 
we have answered Minsky and Papert’s challenge and have found a 
learning result sufficiently powerful to demonstrate that their pes- 
simism about learning in multilayer machines was misplaced.” 
(vol. 1, p. 361) 

But the experiments in PDP , though interesting and ingenious, do 
not actually demonstrate any such thing. In fact, the “powerful 
new learning result” is nothing other than a straightforward hill- 
climbing algorithm, with all the problems that entails. To see how 
GD works, assume we are given a network of units interconnected 
by weighted, unidirectional links. Certain of these units are con- 
nected to input terminals, and certain others are regarded as output 
units. We want to teach this network to respond to each (vector) 
input pattern X p with a specified output vector Y p . How can we find 
a set of weights w = {w 0 } that will accomplish this? We could try to 
do it by hill-climbing on the space of Ws, provided that we could 
define a suitable measure of relative altitude or “success.” One 
problem is that there cannot be any standard, universal way to 
measure errors, because each type of error has different costs in 
different situations. But let us set that issue aside and do what 
scientists often do when they can’t think of anything better: sum 
the squares of the differences. So, if X(W,X) is the network’s out- 
put vector for internal weights W and inputs X, define the altitude 
function E( W) to be this sum: 

E( W) = - X [ Y P - Y (W, X p)] 2 - 

all input 
patterns p 



Epilogue [259] 


In other words, we compute our measure of success by presenting 
successively each stimulus X p to the network. Then we compute 
the (vector) difference between the actual output and the desired 
output. Finally, we add up the squares of the magnitudes of those 
differences. (The minus sign is simply for thinking of climbing up 
instead of down.) The error function E will then have a maximum 
possible value of zero, which will be achieved if and only if the 
machine performs perfectly. Otherwise there will be at least one 
error and E{ W) will be negative. Then all we have to is climb the hill 
E( W) defined over the (high-dimensional) space of weight vectors 
W. If our paths reaches a W for which E(W) is zero, our problem 
will be solved and we will be able to say that our machine has 
“learned from its experience.” 

Well use a process that climbs this hill by the method of steepest 
ascent. We can do this by estimating, at every step, the partial 
derivatives dE/dwy of the total error with respect to each compo- 
nent of the weight vector. This tells us the direction of the gradient 
vector dEldSN , and we then proceed to move a certain distance in 
that direction. This is the mathematical character of the General- 
ized Delta procedure, and it differs in no significant way from older 
forms of diameter-limited gradient followers. 

Before such a procedure can be employed, there is an obstacle to 
overcome. One cannot directly apply the method of gradient ascent 
to networks that contain threshold units. This is because the 
derivative of a step-function is zero, whenever it exists, and hence 
no gradient is defined. To get around this, PDP applies a smoothing 
function to make those threshold functions differentiable. The 
trick is to replace the threshold function for each unit with a mono- 
tonic and differentiable function of the sum of that unit’s inputs. 
This permits the output of each unit to encode information about 
the sum of its inputs while still retaining an approximation to the 
perceptron’s decision-making ability. Then gradient ascent be- 
comes more feasible. However, we suspect that this smoothing 
trick may entail a large (and avoidable) cost when the predicate to 
be learned is actually a composition of linear threshold functions. 
There ought to be a more efficient alternative based on how much 
each weight must be changed, for each stimulus, to make the local 
input sum cross the threshold. 



[260] Epilogue 


In what sense is the particular hill-climbing procedure GD more 
powerful than the perceptron’s PC? Certainly GD can be applied to 
more networks than PC can, because PC can operate only on the 
connections between one layer of (p units and a single output unit. 
GD, however, can modify the weights in an arbitrary multilayered 
network, including nets containing loops. Thus, in contrast to the 
perceptron (which is equipped with some fixed set of <ps that can 
never be changed), GP can be regarded as able to change the 
weights inside the cps. Thus GD promises, in effect, to be able 
discover useful new cp functions — and many of the experiments 
reported in PDP demonstrate that this often works. 

A natural way to estimate the gradient of E{ W) is to estimate dE/dw;j 
by running through the entire set of inputs for each weight. How- 
ever, for large networks and large problems that could be a hor- 
rendous computation. Fortunately, in a highly connected network, 
all those many components of the gradient are not independent of 
one another, but are constrained by the algebraic “chain rule” for 
the derivatives of composite functions. One can exploit those con- 
straints to reduce the amount of computation by applying the chain- 
rule formula, recursively, to the mathematical description of the 
network. This recursive computation is called “back-propagation” 
in PDP. It can substantially reduce the amount of calculation for 
each hill-climbing step in networks with many connections. We have 
the impression that many people in the connectionist community do 
not understand that this is merely a particular way to compute a 
gradient and have assumed instead that back-propagation is a new 
learning scheme that somehow gets around the basic limitations of 
hill-climbing. 

Clearly GD would be far more valuable than PC if it could be made 
to be both efficient and dependable. But virtually nothing has been 
proved about the range of problems upon which GD works both 
efficiently and dependably. Indeed, GD can fail to find a solution 
when one exists, so in that narrow sense it could be considered less 
powerful than PC. 

In the early years of cybernetics, everyone understood that hill- 
climbing was always available for working easy problems, but that 
it almost always became impractical for problems of larger sizes 



Epilogue [261] 


and complexities. We were very pleased to discover (see section 
11.6) that PC could be represented as hill-climbing; however, that 
very fact led us to wonder whether such procedures could depend- 
ably be generalized, even to the limited class of multilayer ma- 
chines that we named Gamba perceptrons. The situation seems not 
to have changed much — we have seen no contemporary connec- 
tionist publication that casts much new theoretical light on the 
situation. Then why has GD become so popular in recent years? In 
part this is because it is so widely applicable, and because it does 
indeed yield new results (at least on problems of rather small 
scale). Its reputation also gains, we think, from its being presented 
in forms that share, albeit to a lesser degree, the biological plausi- 
bility of PC. But we fear that its reputation also stems from 
unfamiliarity with the manner in which hill-climbing methods dete- 
riorate when confronted with larger-scale problems. 

In any case, little good can come from statements like “as a practi- 
cal matter, GD leads to solutions in virtually every case” or “GD 
can, in principle, learn arbitrary functions.” Such pronouncements 
are not merely technically wrong; more significantly, the pretense 
that problems do not exist can deflect us from valuable insights that 
could come from examining things more carefully. As the field of 
connectionism becomes more mature, the quest for a general solution 
to all learning problems will evolve into an understanding of which 
types of learning processes are likely to work on which classes 
of problems. And this means that, past a certain point, we won’t be 
able to get by with vacuous generalities about hill-climbing. We 
will really need to know a great deal more about the nature of those 
surfaces for each specific realm of problems that we want to solve. 

On the positive side, we applaud those who bravely and roman- 
tically are empirically applying hill-climbing methods to many new 
domains for the first time, and we expect such work to result in 
important advances. Certainly these researchers are exploring net- 
works with architectures far more complex than those of percep- 
trons, and some of their experiments already have shown indica- 
tions of new phenomena that are well worth trying to understand. 

Scaling Problems Up in Size 

Experiments with toy-scale problems have proved as fruitful in 
artificial intelligence as in other areas of science and engineering. 



[262] Epilogue 


Many techniques and principles that ultimately found real applica- 
tions were discovered and honed in microworlds small enough to 
comprehend yet rich enough to challenge our thinking. But not 
every phenomenon encountered in dealing with small models can 
be usefully scaled up. Looking at the relative thickness of the legs 
of an ant and an elephant reminds us that physical structures do not 
always scale linearly: an ant magnified a thousand times would 
collapse under its own weight. Much of the theory of computa- 
tional complexity is concerned with questions of scale. If it takes 
100 steps to solve a certain kind of equation with four terms, how 
many steps will it take to solve the same kind of equation with eight 
terms? Only 200, if the problem scales linearly. But for other prob- 
lems it will take not twice 100 but 100 squared. 

For example, the Gamba perceptron of figure 2 needs only two cp 
functions rather than the six required in figure 1. In neither of these 
two toy-sized networks does the number seem alarmingly large. 
One network has fewer units; the other has smaller coefficients. 
But when we examine how those numbers grow with retinas of 
increasing size, we discover that whereas the coefficients of figure 
1 remain constant, those of figure 2 grow exponentially. And, pre- 
sumably, a similar price must be paid again in the number of repeti- 
tions required in order to learn. 

In the examination of theories of learning and problem solving, the 
study of such growths in cost is not merely one more aspect to be 
taken into account; in a sense, it is the only aspect worth consider- 
ing. This is because so many problems can be solved “in principle” 
by exhaustive search through a suitable space of states. Of course, 
the trouble with that in practice is that there is usually an exponen- 
tial increase in the number of steps required for an exhaustive 
search when the scale of the problem is enlarged. Consequently, 
solving toy problems by methods related to exhaustive search 
rarely leads to practical solutions to larger problems. For example, 
though it is easy to make an exhaustive-search machine that never 
loses a game of noughts and crosses, it is infeasible to do the same 
for chess. We do not know if this fact is significant, but many of the 
small examples described in PDF could have been solved as 
quickly by means of exhaustive search — that is, by systematically 
assigning and testing all combinations of small integer weights. 



Epilogue [263] 


When we started our research on perceptrons, we had seen many 
interesting demonstrations of perceptrons solving problems of 
very small scale but not doing so well when those problems were 
scaled up. We wondered what was going wrong. Our first “handle” 
on how to think about scaling came with the concept of the order of 
a predicate. If a problem is of order N, then the number of cps for 
the corresponding perceptron need not increase any faster than as 
the Mh power of R. Then, whenever we could show that a given 
problem was of low order, we usually could demonstrate that per- 
ceptron-like networks could do surprisingly well on that problem. 
On the other hand, once we developed the more difficult tech- 
niques for showing that certain other problems have unbounded 
order, this raised alarming warning flags about extending their so- 
lutions to larger domains. 

Unbounded order was not the only source of scaling failures. An- 
other source — one we had not anticipated until the later stages of 
our work — involved the size, or rather the information content, of 
the coefficients. The information stored in connectionist systems is 
embodied in the strengths of weights of the connections between 
units. The idea that learning can take place by changing such 
strengths has a ring of biological plausibility, but that plausibility 
fades away if those strengths are to be represented by numbers that 
must be accurate to ten or twenty decimal orders of significance. 

The Problem of Sampling Variance 

Our description of the Generalized Delta Rule assumes that it is 
feasible to compute the new value of E( W) at every step of the 
climb. The processes discussed in chapter 8 of PDP typically re- 
quire only on the order of 100,000 iterations, a range that is easily 
accessible to computers (but that might in some cases strain our 
sense of biological plausibility). However, it will not be practical, 
with larger problems, to cycle through all possible input patterns. 
This means that when precise measures of E( W) are unavailable, 
we will be forced to act, instead, on the basis of incomplete sam- 
ples — for example, by making a small hill-climbing step after each 
reaction to a stimulus. (See the discussion of complete versus in- 
cremental methods in subsection 12.1.1.) When we can no longer 
compute dE/dW precisely but can only estimate its components, 
then the actual derivative will be masked by a certain amount of 



[264] Epilogue 


sampling noise. The text of PDP argues that using sufficiently small 
steps can force the resulting trajectory to come arbitrarily close to 
that which would result from knowing dE/d\V precisely. When we 
tried to prove this, we were led to suspect that the choice of step 
size may depend so much on the higher derivatives of the smooth- 
ing functions that large-scale problems could require too many 
steps for such methods to be practical. 

So far as we could tell, every experiment described in chapter 8 of 
PDP involved making a complete cycle through all possible input 
situations before making any change in weights. Whenever this is 
feasible, it completely eliminates sampling noise — and then even 
the most minute correlations can become reliably detectable, be- 
cause the variance is zero. But no person or animal ever faces 
situations that are so simple and arranged in so orderly a manner as 
to provide such cycles of teaching examples. Moving from small to 
large problems will often demand this transition from exhaustive to 
statistical sampling, and we suspect that in many realistic situa- 
tions the resulting sampling noise would mask the signal com- 
pletely. We suspect that many who read the connectionist litera- 
ture are not aware of this phenomenon, which dims some of the 
prospects of successfully applying certain learning procedures to 
large-scale problems. 

Problems of Scaling 

In principle, connectionist networks offer all the potential of uni- 
versal computing devices. However, our examples of order and 
coefficient size suggest that various kinds of scaling problems are 
likely to become obstacles to attempts to exploit that potential. 
Fortunately, our analysis of perceptrons does not suggest that con- 
nectionist networks need always encounter these obstacles. In- 
deed, our book is rich in surprising examples of tasks that simple 
perceptrons can perform using relatively low-order units and small 
coefficients. However, our analysis does show that parallel net- 
works are, in general, subject to serious scaling phenomena. Con- 
sequently, researchers who propose such models must show that, 
in their context, those phenomena do not occur. 

The authors of PDP seem disinclined to face such problems. They 
seem content to argue that, although we showed that single-layer 
networks cannot solve certain problems, we did not know that 



Epilogue [265] 


there could exist a powerful learning procedure for multilayer net- 
works — to which our theorems no longer apply. However, strictly 
speaking, it is wrong to formulate our findings in terms of what 
perceptrons can and cannot do. As we pointed out above, percep- 
trons of sufficiently large order can represent any finite predicate. 
A better description of what we did is that, in certain cases, we 
established the computational costs of what perceptrons can do as 
a function of increasing problem size. The authors of PDP show 
little concern for such issues, and usually seem content with exper- 
iments in which small multilayer networks solve particular in- 
stances of small problems. 

What should one conclude from such examples? A person who 
thinks in terms of can versus can't will be tempted to suppose that 
if toy machines can do something, then larger machines may well 
do it better. One must always probe into the practicality of a pro- 
posed learning algorithm. It is no use to say that 4 ‘procedure P is 
capable of learning to recognize pattern X ” unless one can show 
that this can be done in less time and at less cost than with exhaus- 
tive search. Thus, as we noted, in the case of symmetry, the authors of 
PDP actually recognized that the coefficients were growing as 
powers of 2, yet they did not seem to regard this as suggesting that 
the experiment worked only because of its very small size. But 
scientists who exploit the insights gained from studying the single- 
layer case might draw quite different conclusions. 

The authors of PDP recognize that GD is a form of hill-climber, but 
they speak as though becoming trapped on local maxima were 
rarely a serious problem. In reporting their experiments with learn- 
ing the XOR predicate, they remark that this occurred “in only two 
cases ... in hundreds of times.” However, that experiment in- 
volved only the toy problem of learning to compute the XOR of two 
arguments. We conjecture that learning XOR for larger numbers of 
variables will become increasingly intractable as we increase the 
numbers of input variables, because by its nature the underlying 
parity function is absolutely uncorrelated with any function of 
fewer variables. Therefore, there can exist no useful correlations 
among the outputs of the lower-order units involved in computing 
it, and that leads us to suspect that there is little to gain from 
following whatever paths are indicated by the artificial introduc- 
tion of smoothing functions that cause partial derivatives to exist. 



[266] Epilogue 


The PDP experimenters encountered a more serious local-maxi- 
mum problem when trying to make a network learn to add two 
binary numbers — a problem that contains an embedded XOR prob- 
lem. When working with certain small networks, the system got 
stuck reliably. However, the experimenters discovered an inter- 
esting way to get around this difficulty by introducing longer chains 
of intermediate units. We encourage the reader to study the discus- 
sion starting on page 341 of PDP and try to make a more complete 
theoretical analysis of this problem. We suspect that further study 
of this case will show that hill-climbing procedures can indeed get 
multilayer networks to learn to do multidigit addition. However, 
such a study should be carried out not to show that “networks are 
good” but to see which network architectures are most suitable for 
enabling the information required for “carrying” to flow easily 
from the smaller to the larger digits. In the PDP experiment, the 
network appears to us to have started on the road toward inventing 
the technique known to computer engineers as “carry jumping.” 

To what extent can hill-climbing systems be made to solve hard 
problems? One might object that this is a wrong question because 
“hard” is so ill defined. The lesson of Perceptrons is that we must 
find ways to make such questions meaningful. In the case of hill- 
climbing, we need to find ways to characterize the types of prob- 
lems that lead to the various obstacles to climbing hills, instead of 
ignoring those difficulties or trying to find universal ways to get 
around them. 

The Society of Mind 

The preceding section was written as though it ought to be the 
principal goal of research on network models to determine in which 
situations it will be feasible to scale their operations up to deal with 
increasingly complicated problems. But now we propose a some- 
what shocking alternative: Perhaps the scale of the toy problem is 
that on which, in physiological actuality, much of the functioning 
of intelligence operates. Accepting this thesis leads into a way of 
thinking very different from that of the connectionist movement. 
We have used the phrase “society of mind” to refer to the idea that 
mind is made up of a large number of components, or “agents,” 
each of which would operate on the scale of what, if taken in 



Epilogue [267] 


isolation, would be little more than a toy problem. [See Marvin 
Minsky, The Society of Mind (Simon and Schuster, 1987) and Sey- 
mour Papert, Mindstorms (Basic Books, 1982).] 

To illustrate this idea, let’s try to compare the performance of the 
symmetry perceptron in PDF with human behavior. An adult hu- 
man can usually recognize and appreciate the symmetries of a 
kaleidoscope, and that sort of example leads one to imagine that 
people do very much better than simple perceptrons. But how 
much can people actually do? Most people would be hard put to be 
certain about the symmetry of a large pattern. For example, how 
long does it take you to decide whether or not the following pattern 
is symmetrical? 

DB4HWUK85HCNZEWJKRKJWEZNCH58KUWH4BD 

In many situations, humans clearly show abilities far in excess of 
what could be learned by simple, uniform networks. But when we 
take those skills apart, or try to find out how they were learned, we 
expect to find that they were made by processes that somehow 
combined the work (already done in the past) of many smaller 
agencies, none of which, separately, need to work on scales much 
larger than do those in PDF. Is this hypothesis consistent with the 
PDP style of connectionism? Yes, insofar as the computations of 
the nervous system can be represented as the operation of societies 
of networks. But no, insofar as the mode of operation of those 
societies of networks (as we imagine them) raises theoretical issues 
of a different kind. We do not expect procedures such as GD to be 
able to produce such societies. Something else is needed. 

What that something must be depends on how we try to extend the 
range of small connectionist models. We see two principal alterna- 
tives. We could extend them either by scaling up small connection- 
ist models or by combining small-scale networks into some larger 
organization. In the first case, we would expect to encounter theo- 
retical obstacles to maintaining GD’s effectiveness on larger, 
deeper nets. And despite the reputed efficacy of other alleged rem- 
edies for the deficiencies of hill-climbing, such ^s “annealing,” we 
stay with our research conjecture that no such procedures will 
work very well on large-scale nets, except in the case of problems 
that turn out to be of low order in some appropriate sense. The 



[268] Epilogue 


second alternative is to employ a variety of smaller networks rather 
than try to scale up a single one. And if we choose (as we do) to 
move in that direction, then our focus of concern as theoretical 
psychologists must turn toward the organizing of small nets into 
effective large systems. The idea that the lowest levels of thinking 
and learning may operate on toy-like scales fits many of our com- 
mon-sense impressions of psychology. For example, in the realm 
of language, any normal person can parse a great many kinds of 
sentences, but none of them past a certain bound of involuted 
complexity. We all fall down on expressions like “the cheese that 
the rat that the cat that the dog bit chased ate.” In the realm of 
vision, no one can count great numbers of things, in parallel, at a 
single glance. Instead, we learn to “estimate.” Indeed, the visual 
joke in figure 0.1 shows clearly how humans share perceptrons’ 
inability to easily count and match, and a similar example is em- 
bodied in the twin spirals of figure 5.1. The spiral example was 
intended to emphasize not only that low-order perceptrons cannot 
perceive connectedness but also that humans have similar limita- 
tions. However, a determined person can solve the problem, given 
enough time, by switching to the use of certain sorts of serial men- 
tal processes. 

Beyond Perceptrons 

No single-method learning scheme can operate efficiently for every 
possible task; we cannot expect any one type of machine to ac- 
count for any large portion of human psychology. For example, in 
certain situations it is best to carefully accumulate experience; 
however, when time is limited, it is necessary to make hasty 
generalizations and act accordingly. No single scheme can do all 
things. Our human semblance of intelligence emerged from how 
the brain evolved a multiplicity of ways to deal with different prob- 
lem realms. We see this as a principle that underlies the mind’s 
reality, and we interpret the need for many kinds of mechanisms 
not as a pessimistic and scientifically constraining limitation but as 
the fundamental source of many of the phenomena that artificial 
intelligence and psychology have always sought to understand. 
The power of the brain stems not from any single, fixed, universal 
principle. Instead it comes from the evolution (in both the individ- 
ual sense and the Darwinian sense) of a variety of ways to develop 
new mechanisms and to adapt older ones to perform new functions. 
Instead of seeking a way to get around that need for diversity, we 



Epilogue [269] 


have come to try to develop “society of mind” theories that will 
recognize and exploit the idea that brains are based on many differ- 
ent kinds of interacting mechanisms. 

Several kinds of evidence impel us toward this view. One is the 
great variety of different and specific functions embodied in the 
brain’s biology. Another is the similarly great variety of phenom- 
ena in the psychology of intelligence. And from a much more ab- 
stract viewpoint, we cannot help but be impressed with the practi- 
cal limitations of each “general” scheme that has been proposed — 
and with the theoretical opacity of questions about how they be- 
have when we try to scale their applications past the toy problems 
for which they were first conceived. 

Our research on perceptrons and on other computational schemes 
has left us with a pervasive bias against seeking a general, domain- 
independent theory of “how neural networks work.” Instead, we 
ought to look for ways in which particular types of network models 
can support the development of models of particular domains of 
mental function — and vice versa. Thus, our understanding of the 
perceptron’s ability to perform geometric tasks was actually based 
on theories that were more concerned with geometry than with 
networks. And this example is supported by a broad body of expe- 
rience in other areas of artificial intelligence. Perhaps this is why 
the current preoccupation of connectionist theorists with the 
search for general learning algorithms evokes for us two aspects of 
the early history of computation. 

First, we are reminded of the long line of theoretical work that 
culminated in the “pessimistic” theories of Godel and Turing 
about the limitations on effective computability. Yet the realization 
that there can be no general-purpose decision procedure for mathe- 
matics had not the slightest dampening effect on research in mathe- 
matics or in computer science. On the contrary, awareness of those 
limiting discoveries helped motivate the growth of rich cultures 
involved with classifying and understanding more specialized al- 
gorithmic methods. In other words, it was the realization that seek- 
ing overgeneral solution methods would be as fruitless as — and 
equivalent to — trying to solve the unsolvable halting problem for 
Turing machines. Abandoning this then led to seeking progress in 
more productive directions. 



[270] Epilogue 


Our second thought is about how the early research in artificial 
intelligence tended to focus on general-purpose algorithms for rea- 
soning and problem solving. Those general methods will always 
play their roles, but the most successful applications of AI research 
gained much of their practical power from applying specific knowl- 
edge to specific domains. Perhaps that work has now moved too far 
toward ignoring general theoretical considerations, but by now we 
have learned to be skeptical about the practical power of unre- 
strained generality. 

Interaction and Insulation 

Evolution seems to have anticipated these discoveries. Although 
the nervous system appears to be a network, it is very far from 
being a single, uniform, highly interconnected assembly of units 
that each have similar relationships to the others. Nor are all brain 
cells similarly affected by the same processes. It would be better to 
think of the brain not as a single network whose elements operate 
in accord with a uniform set of principles but as a network whose 
components are themselves networks having a large variety of dif- 
ferent architectures and control systems. This “society of mind” 
idea has led our research perspective away from the search for 
algorithms, such as GD, that were hoped to work across many 
domains. Instead, we were led into trying to understand what 
specific kinds of processing would serve specific domains. 

We recognize that the idea of distributed, cooperative processing 
has a powerful appeal to common sense as well to computational 
and biological science. Our research instincts tell us to discover as 
much as we can about distributed processes. But there is another 
concept, complementary to distribution, that is no less strongly 
supported by the same sources of intuition. We’ll call it insulation. 

Certain parallel computations are by their nature synergistic and 
cooperative: each part makes the others easier. But the And/Or of 
theorem 4.0 shows that under other circumstances, attempting to 
make the same network perform two simple tasks at the same time 
leads to a task that has a far greater order of difficulty. In those 
sorts of circumstances, there will be a clear advantage to having 
mechanisms, not to connect things together, but to keep such tasks 
apart. How can this be done in a connectionist net? Some recent 
work hints that even simple multilayer perceptron-like nets can 



Epilogue [271] 


learn to segregate themselves into quasi-separate components — 
and that suggests (at least in principle) research on uniform learn- 
ing procedures. But it also raises the question of how to relate 
those almost separate parts. In fact, research on networks in which 
different parts do different things and learn those things in different 
ways has become our principal concern. And that leads us to ask 
how such systems could develop managers for deciding, in differ- 
ent circumstances, which of those diverse procedures to use. 

For example, consider all the specialized agencies that the human 
brain employs to deal with the visual perception of spatial scenes. 
Although we still know little about how all those different agencies 
work, the end result is surely even more complex than what we 
described in section 13.4. Beyond that, human scene analysis also 
engages our memories and goals. Furthermore, in addition to all 
the systems we humans use to dissect two-dimensional scenes into 
objects and relationships, we also possess machinery for exploiting 
stereoscopic vision. Indeed, there appear to be many such agen- 
cies — distinct ones that employ, for example, motion cues, dis- 
parities, central correlations of the Julesz type, and memory-based 
frame-array-like systems that enable us to imagine and virtually 
“see” the occluded sides of familiar objects. Beyond those, we 
seem also to have been supplied with many other visual agencies — 
for example, ones that are destined to learn to recognize faces and 
expressions, visual cliffs, threatening movements, sexual attrac- 
tants, and who knows how many others that have not been discov- 
ered yet. What mechanisms manage and control the use of all those 
diverse agencies? And from where do those managers come? 

Stages of Development 

In Mindstorms and in The Society of Mind, we explained how the 
idea of intermediate, hidden processes might well account for some 
phenomena discovered by Piaget in his experiments on how chil- 
dren develop their concepts about the “conservation of quantity.” 
We introduced a theory of mental growth based on inserting, at 
various times, new inner layers of “management” into already 
existing networks. In particular, we argued that, to learn to make 
certain types of comparisons, a child’s mind must construct a mul- 
tilayer structure that we call a “society-of-more.” The lower levels 
of that net contain agents specialized to make a variety of spatial 



[272] Epilogue 


and temporal observations. Then the higher-level agents learn to 
classify, and then control, the activities of the lower ones. We 
certainly would like to see a demonstration of a learning process 
that could spontaneously produce the several levels of agents 
needed to embody a concept as complex as that. Chapter 17 of The 
Society of Mind offers several different reasons why this might be 
very difficult to do except in systems under systematic controls, 
both temporal and architectural. We suspect that it would require 
far too long, in comparison with an infant’s months of life, to create 
sophisticated agencies entirely by undirected, spontaneous learn- 
ing. Each specialized network must begin with promising ingre- 
dients that come either from prior stages of development or from 
some structural endowment that emerged in the course of organic 
evolution. 

When should new layers of control be introduced? If managers are 
empowered too soon, when their workers still are too immature, 
they won’t be able to accomplish enough. (If every agent could 
learn from birth, they would all be overwhelmed by infantile ideas.) 
But if the managers arrive too late, that will retard all further 
growth. Ideally, every agency’s development would be controlled 
by yet another agency equipped to introduce new agents just when 
they are needed — that is, when enough has been learned to justify 
the start of another stage. However, that would require a good deal 
of expertise on the controlling agency’s part. Another way — much 
easier to evolve — would simply enable various agencies to estab- 
lish new connections at genetically predetermined times (perhaps 
while also causing lower-level parts to slow further growth). Such a 
scheme could benefit a human population on the whole, although 
it might handicap individuals who, for one reason or another, hap- 
pen to move ahead of or behind that inborn “schedule.” In any 
case, there are many reasons to suspect that the parts of any sys- 
tem as complex as a human mind must grow through sequences of 
stage-like episodes. 

Architecture and Specialization 

The tradition of connectionism has always tried to establish two 
claims: that connectionist networks can accomplish interesting 
tasks and that they can learn to do so with no explicit program- 
ming. But a closer look reveals that rarely are those two virtues 



Epilogue [273] 


present in the same device. It is true that networks, taken as a 
class, can do virtually anything. However, each particular type of 
network can best learn only certain types of things. Each particular 
network we have seen seems relatively limited. Yet our wondrous 
brains are themselves composed of connected networks of cells. 

We think that the difference in abilities comes from the fact that a 
brain is not a single, uniformly structured network. Instead, each 
brain contains hundreds of different types of machines, intercon- 
nected in specific ways which predestine that brain to become a 
large, diverse society of partially specialized agencies. We are born 
with specific parts of our brains to serve every sense and muscle 
group, and with perhaps separate sections for physical and social 
matters (e.g., natural sounds versus social speech, inanimate 
scenes versus facial expressions, mechanical contacts versus so- 
cial caresses). Our brains also embody proto-specialists involved 
with hunger, laughter, anger, fear, and perhaps hundreds of other 
functions that scientists have not yet isolated. Many thousands of 
genes must be involved in constructing specific internal architec- 
tures for each of those highly evolved brain centers and in laying 
out the nerve bundles that interconnect them. And although each 
such system is embodied in the form of a network-based learning 
system, each almost surely also learns in accord with somewhat 
different principles. 

Why did our brains evolve so as to contain so many specialized 
parts? Could not a single, uniform network learn to structure itself 
into divisions with appropriate architectures and processes? We 
think that this would be impractical because of the problem of repre- 
senting knowledge. In order for a machine to learn to recognize or 
perform X , be it a pattern or a process, that machine must in one 
sense or another learn to represent or embody X. Doing that 
efficiently must exploit some happy triadic relationship between 
the structure of A, the learning procedure, and the initial architec- 
ture of the network. It makes no sense to seek the “best” network 
architecture or learning procedure because it makes no sense to 
say that any network is efficient by itself: that makes sense only in 
the context of some class of problems to be solved. Different kinds 
of networks lend themselves best to different kinds of representa- 
tions and to different sorts of generalizations. This means that the 
study of networks in general must include attempts, like those in 



[274] Epilogue 


this book, to classify problems and learning processes; but it must 
also include attempts to classify the network architectures. This is 
why we maintain that the scientific future of connectionism is tied 
not to the search for some single, universal scheme to solve all 
problems at once but to the evolution of a many-faceted technology 
of “brain design” that encompasses good technical theories about 
the analysis of learning procedures, of useful architectures, and of 
organizational principles to use when assembling those compo- 
nents into larger systems. 

Symbolic versus Distributed 

Let us now return to the conflict posed in our prologue: the war 
between the connectionists and the symbolists. We hope to make 
peace by exploiting both sides. 

There are important virtues in the use of parallel distributed net- 
works. They certainly often offer advantages in simplicity and in 
speed. And above all else they offer us ways to learn new skills 
without the pain and suffering that might come from comprehend- 
ing how. On the darker side, they can limit large-scale growth 
because what any distributed network learns is likely to be quite 
opaque to other networks connected to it. 

Symbolic systems yield gains of their own, in versatility and un- 
limited growth. Above all else they offer us the prospect that com- 
puters share: of not being bound by the small details of the parts of 
which they are composed. But that, too, has its darker side: sym- 
bolic processes can evolve worlds of their own, utterly divorced 
from their origins. Perceptrons can never go insane — but the same 
cannot be said of a brain. 

Now, what are symbols, anyway? We usually conceive of them as 
compact things that represent more complex things. But what, 
then, do we mean by represent ? It simply makes no sense, by itself, 
to say that “S represents T,” because the significance of a symbol 
depends on at least three participants: on S , on T, and on the 
context of some process or user U. What, for example, connects 
the word table to any actual, physical table? Since the words peo- 
ple use are the words people learn, clearly the answer must be that 
there is no direct relationship between S and J, but that there is a 
more complex triadic relationship that connects a symbol, a thing, 



Epilogue [275] 


and a process that is active in some person’s mind. Furthermore, 
when the term symbol is used in the context of network psychol- 
ogy, it usually refers to something that is reassignable so that it can 
be made to represent different things and so that the symbol-using 
processes can learn to deal with different symbols. 

What do we mean by distributed ? This usually refers to a system in 
which each end-effect comes not from any single, localized ele- 
ment-part, but from the interactions of many contributors, all 
working at the same time. Accordingly, in order to make a desired 
change in the output of a distributed system, one must usually alter 
a great many components. And changing the output of any particu- 
lar component will rarely have a large effect in any particular cir- 
cumstance; instead, such changes will tend to have small effects in 
many different circumstances. 

Symbols are tokens or handles with which one specialist can ma- 
nipulate representations within another specialist. But now, sup- 
pose that we want one agency to be able to exploit the knowledge in 
another agency. So long as we stay inside a particular agency, it 
may be feasible to use representations that involve great hosts of 
internal interactions and dependencies. But the fine details of such 
a representation would be meaningless to any outside agency that 
lacks access to, or the capacity to deal with, all that fine detail. 
Indeed, if each representation in the first agency involves activities 
that are uniformly distributed over a very large network, then di- 
rect communication to the other agency would require so many 
connection paths that both agencies would end up enmeshed to- 
gether into a single, uniform net — and then all the units of both 
would interact. 

How, then, could networks support symbolic forms of activities? 
We conjecture that, inside the brain, agencies with different jobs 
are usually constrained to communicate with one another only 
through neurological bottlenecks (i.e., connections between rela- 
tively small numbers of units that are specialized to serve as sym- 
bolic recognizers and memorizers). The recognizers learn to en- 
code significant features of the representation active in the first 
network, and the memorizers learn to evoke an activity that can 
serve a corresponding function in the receiving network. But in 
order to prevent those features from interfering too much with one 



[276] Epilogue 


another, there must be an adequate degree of insulation between 
the units that serve these purposes. And that need for insulation 
can lead to genuine conflicts between the use of symbolic and dis- 
tributed representations. This is because distributed representa- 
tions make it hard to combine (in arbitrary, learnable ways) the 
different fragments of knowledge embodied in different representa- 
tions. The difficulty arises because the more distributed is the rep- 
resentation of each fragment, the fewer fragments can be simulta- 
neously active without interfering with one another. Sometimes 
those interactions can be useful, but in general they will be destruc- 
tive. This is discussed briefly in section 8.2 of The Society of Mind: 

“The advantages of distributed systems are not alternatives to the 
advantages of insulated systems: the two are complementary. To 
say that the brain may be composed of distributed systems is not 
the same as saying that it is a distributed system — that is, a single 
network in which all functions are uniformly distributed. We do not 
believe that any brain of that sort could work, because the interac- 
tions would be uncontrollable. To be sure, we have to explain how 
different ideas can become connected to one another — but we must 
also explain what keeps our separate memories intact. For ex- 
ample, we praised the power of metaphors that allow us to mix the 
ideas we have in different realms — but all that power would be lost 
if all our metaphors got mixed! Similarly, the architecture of a 
mind-society must encourage the formation and maintenance of 
distinct levels of management by preventing the formation of con- 
nections between agencies whose messages have no mutual 
significance. Some theorists have assumed that distributed systems 
are inherently both robust and versatile but, actually, those attri- 
butes are more likely to conflict. Systems with too many interac- 
tions of different types will tend to be fragile, while systems with 
too many interactions of similar types will tend to be too redundant 
to adapt to novel situations and requirements.” 

A larger-scale problem is that the use of widely distributed repre- 
sentations will tend to oppose the formulation of knowledge about 
knowledge. This is because information embodied in distributed 
form will tend to be relatively inaccessible for use as a subject upon 
which other knowledge-based processes can operate. Conse- 
quently (we conjecture), systems that use highly distributed repre- 
sentations will tend to become conceptual dead ends as a result of 
their putting performance so far ahead of comprehension as to 



Epilogue [277] 


retard the growth of reflective thought. Too much diffusing of in- 
formation can make it virtually impossible (for other portions of 
the brain) to find out how results, however useful, are obtained. 
This would make it very difficult to dissect out the components that 
might otherwise be used to construct meaningful variations and 
generalizations. Of course such problems won’t become evident in 
experiments with systems that do only simple things, but we can 
expect to see such problems grow when systems try to learn to do 
more complex things. With highly distributed systems, we should 
anticipate that the accumulation of internal interactions may even- 
tually lead to intractable credit-assignment problems. Perhaps the 
only ultimate escape from the limitations of internal interactions is 
to evolve toward organizations in which each network affects 
others primarily through the use of serial operations and special- 
ized short-term-memory systems, for although seriality is rela- 
tively slow, its uses makes it possible to produce and control in- 
teractions between activities that occur at different and separate 
places and times. 

The Parallel Paradox 

It is often argued that the use of distributed representations enables 
a system to exploit the advantages of parallel processing. But what 
are the advantages of parallel processing? Suppose that a certain 
task involves two unrelated parts. To deal with both concurrently, 
we would have to maintain their representations in two decoupled 
agencies, both active at the same time. Then, should either of those 
agencies become involved with two or more subtasks, we would 
have to deal with each of them with no more than a quarter of the 
available resources. If that proceeded on and on, the system would 
become so fragmented that each job would end up with virtually no 
resources assigned to it. In this regard, distribution may oppose 
parallelism: the more distributed a system is — that is, the more 
intimately its parts interact — the fewer different things it can do at 
the same time. On the other side, the more we do separately in 
parallel, the less machinery can be assigned to each element of 
what we do, and that ultimately leads to increasing fragmentation 
and incompetence. 

This is not to say that distributed representations and parallel pro- 
cessing are always incompatible. When we simultaneously activate 



[278] Epilogue 


two distributed representations in the same network, they will be 
forced to interact. In favorable circumstances, those interactions 
can lead to useful parallel computations, such as the satisfaction of 
simultaneous constraints. But that will not happen in general; it 
will occur only when the representations happen to mesh in suit- 
ably fortunate ways. Such problems will be especially serious 
when we try to train distributed systems to deal with problems that 
require any sort of structural analysis in which the system must 
represent relationships between substructures of related types — 
that is, problems that are likely to compete for the same limited 
resources. 

On the positive side, there are potential virtues to embodying 
knowledge in the form of networks of units with weighted intercon- 
nections. For example, distributed representations can sometimes 
be used to gain the robustness of redundancy, to make machines 
that continue to work despite having injured, damaged, or unreli- 
able components. They can embody extremely simple learning al- 
gorithms, which operate in parallel with great speed. 

Representations and Generalizations 

It is often said that distributed representations are inherently pos- 
sessed of useful holistic qualities; for example, that they have in- 
nate tendencies to recognize wholes from partial cues — even for 
patterns they have not encountered before. Phenomena of that sort 
are often described with such words as generalization, induction, 
or gestalt . Such phenomena certainly can emerge from connection- 
ist assemblies. The problem is that, for any body of experience, 
there are always many kinds of generalizations that can be made. 
The ones made by any particular network are likely to be inappro- 
priate unless there happens to be an appropriate relationship be- 
tween the network’s architecture and the manner in which the 
problem is represented. What makes architectures and representa- 
tions appropriate? One way to answer that is to study how they 
affect which signals will be treated as similar. 

Consider the problem of comparing an arbitrary input pattern with 
a collection of patterns in memory, to find which memory is most 
similar to that stimulus. In section 12.7 we conjectured that solving 
best-match problems will always be very tedious when serial hard- 



Epilogue [279] 


ware is used. PDP suggests another view in regard to parallel, 
distributed machines: “This is precisely the kind of problem that is 
readily implemented using highly parallel algorithms of the kind we 
consider.” This is, in some ways, plausible, since a sufficiently 
parallel machine could simultaneously match an input pattern 
against every pattern in its memory. And yet the assertion is 
quaintly naive, since best match means different things in different 
circumstances. Which answers should be accepted as best always 
depends on the domain of application. The very same stimulus may 
signify food to one animal, companionship to another, and a 
dangerous predator to a third. Thus, there can be no single, univer- 
sal measure of how well two descriptions match; every context 
requires appropriate schemes. Because of this, distributed net- 
works do not magically provide solutions to such best-match prob- 
lems. Instead, the functional architecture of each particular net- 
work imposes its own particular sort of metrical structure on the 
space of stimuli. Such structures may often be useful. Yet, that can 
give us no assurance that the outcome will correspond to what an 
expert observer would consider to be the very best match, given 
that observer’s view of what would be the most appropriate re- 
sponse in the current context or problem realm. 

We certainly do not mean to suggest that networks cannot perform 
useful matching functions. We merely mean to emphasize that dif- 
ferent problems entail different matching criteria, and that hence 
no particular type of network can induce a topology of similarity or 
nearness that is appropriate for every realm. Instead, we must 
assume that, over the course of time, each specialized portion of 
the brain has evolved a particular type of architecture that is rea- 
sonably likely to induce similarity relationships that are useful in 
performing the functions to which that organ is likely (or destined) 
to be assigned. Perhaps an important activity of future connection- 
ist research will be to develop networks that can learn to embody 
wide ranges of different, context-dependent types of matching 
functions. 

We have also often heard the view that machines that employ lo- 
calized or symbolic representations must be inherently less capa- 
ble than are distributed machines of insight, consciousness, or 
sense of self. We think this stands things on their heads. It is 
because our brains primarily exploit connectionist schemes that 



[280] Epilogue 


we possess such small degrees of consciousness, in the sense that 
we have so little insight into the nature of our own conceptual 
machinery. We agree that distributed representations probably are 
used in virtually every part of the brain. Consequently, each 
agency must learn to exploit the abilities of the others without 
having direct access to compact representations of what happens 
inside those other agencies. This makes direct insight infeasible; 
the best such agencies can do is attempt to construct their own 
models of the others on the basis of approximate, pragmatic mod- 
els based on presuppositions and concepts already embodied in the 
observing agency. Because of this, what appear to us to be direct 
insights into ourselves must be rarely genuine and usually conjec- 
tural. Accordingly, we expect distributed representations to tend 
to produce systems with only limited abilities to reflect accurately 
on how they do what they do. Thinking about thinking, we main- 
tain, requires the use of representations that are localized enough 
that they can be dissected and rearranged. Besides, distributed 
representations spread out the information that goes into them. 
The result of this is to mix and obscure the effects of their separate 
elements. Thus their use must entail a heavy price; surely, many of 
them must become “conceptual dead ends” because the perfor- 
mances that they produce emerge from processes that other agen- 
cies cannot comprehend. In other words, when the representations 
of concepts are distributed, this will tend to frustrate attempts of 
other agencies to adapt and transfer those concepts to other con- 
texts. 

How much, then, can we expect from connectionist systems? 
Much more than the above remarks might suggest, since reflective 
thought is the lesser part of what our minds do. Most probably, we 
think, the human brain is, in the main, composed of large numbers 
of relatively small distributed systems, arranged by embryology 
into a complex society that is controlled in part (but only in part) by 
serial, symbolic systems that are added later. But the subsymbolic 
systems that do most of the work from underneath must, by their 
very character, block all the other parts of the brain from knowing 
much about how they work. And this, itself, could help explain 
how people do so many things yet have such incomplete ideas of 
how those things are actually done. 



Bibliographic Notes 


The following remarks are intended to introduce the literature 
of this field. This is not to be considered an attempt at historical 
scholarship, for we have made no serious venture in that direc- 
tion. 

In a decade of work on the family of machines loosely called 
perceptrons, we find an interacting evolution and refinement of 
two ideas: first, the concept of realizing a predicate as a linear 
threshold function of much more local predicates; second, the 
idea of a convergence or “learning” theorem. The most com- 
monly held version of this history sees the perceptron invented 
by Rosenblatt in a single act, with the final proof of the con- 
vergence theorem vindicating his insight in the face of skepticism 
from the scientific world. This is an oversimplification, especially 
in its taking the concept of perceptron as static. For in fact 
a key part of the process leading to the convergence theorem was 
the molding of the concept of the machine to the appropriate 
form. (Indeed, how often does “finding the proof” of a conjec- 
ture involve giving the conjecture a more provable form?) 

In the early papers one sees a variety, both of machines and of 
“training” procedures, converging in the course of accumulation 
of mathematical insight toward the simple concepts we have used 
in this book. Students interested in this evolution can read: 

Rosenblatt, Frank (1959), “Two theorems of statistical separability in 
the perceptron,” Proceedings of a Symposium on the Mechanization of 
Thought Processes , Her Majesty’s Stationary Office, London, pp. 42 1 — 
456; 

Rosenblatt, Frank (1962), Principles of Neurodynamics , Spartan Books, 
New York. 

In a variety of contexts, other perceptronlike learning experiments 
had been described. Quite well-known was the paper of 

Samuel, Arthur L. (1959), “Some studies in machine learning using the 
game of checkers,” IBM Journal of Research and Development , Vol. 3, 
No. 3, pp. 210-223 

who describes a variety of error-correcting vector addition pro- 
cedures. In a later paper 

Samuel, Arthur L. (1967), “Some studies in machine learning using the 
game of checkers, Part II,” IBM Journal of Research and Development , 
Vol. 1 1, No. 4, pp. 601-618 

he describes studies that lead toward detecting more complex 



[282] Bibliographic Notes 


interactions between the partial predicates. The simple multilayer 
perceptronlike machines discussed in Chapter 13 were described 
in 

Palmieri, G. and R. Sanna (1960), Methodos , Vol. 12, No. 48; 

Gamba, A., L. Gamberini, G. Palmieri, and R. Sanna (1961), ‘‘Further 
experiments with PAPA,” Nuovo Cimento Suppl. No. 2, Vol. 20, pp. 22 1 — 
231. 

Some earlier reward-modified machines, further from the final 
form of the perceptron, are described in 

Ashby, W. Ross (1952), Design fora Brain , Wiley, New York; 

Clark, Wesley A., and Farley, B. G. (1955), “Generalization of pattern- 
recognition in a self-organizing system,” Proceedings 1955 Western Joint 
Computer Conference , pp. 85- 111; 

Minsky, M. (1954), “Neural nets and the brain-model problem,” doc- 
toral dissertation, Princeton University, Princeton, N.J.; 

Uttley, A. M. (1956), “Conditional probability machines,” in Automata 
Studies , Princeton University, Princeton, N.J., pp. 253-285. 

The proof of the convergence theorem (Theorem 11.1) is another 
example of this sort of evolution. In an abstract mathematical 
sense, both theorem and proof already existed before the percep- 
tron, for several people had considered the idea of solving a set 
of linear inequalities by “relaxation” methods— successive adjust- 
ments much like those used in the perceptron proceduce. An 
elegant early paper on this is 

Agmon, S. (1954), “The relaxation method for linear inequalities,” 
Canadian Journal of Mathematics , Vol. 6, No. 3, pp. 382-392. 

In Agmon’s procedure, one computes the <F-vector that gives the 
largest numerical error in the satisfaction of the linear inequality, 
and uses a multiple of that vector for correction. (See §11.4.) 
We do not feel sufficiently scholarly to offer an opinion on 
whether this paper should deserve priority for the discovery of 
p essible. the convergence theorem. It is- q t rite clear - that the theorem would 

have been instantly obvious had the cyberneticists interested in 
perceptrons known about Agmon’s work. 

In any case, the first proofs of the convergence theorem offered 
in cybernetic circles were quite independent of the work on linear 
inequalities. See, for example 



Bibliographic Notes [283] 


Block, H. D. (1962), “The perceptron: a model for brain functioning,” 
Reviews of Modern Physics , Vol. 34, No. 1, pp. 123 135. 

This proof was quite complicated. The first use known to us of the 
simpler kind of analysis used in §1 1 . 1 is in 

Papert, Seymour (1961), “Some Mathematical Models of Learning,” 
Proceedings of the Fourth London Symposium on Information Theory , 
C. Cherry, Editor, Academic Press, New York. 

Curiously, this paper is not mentioned by any later commentators 
(including the usually scholarly Nilsson) other than Rosenblatt in 
Neurodynamics. The convergence theorem is well discussed, in a 
variety of settings, by 

Nilsson, Nils (1965), Learning Machines , McGraw-Hill, New York, 

who includes a number of historical notes. Readers who consult 
the London Symposium volume might also read 

Minsky, Marvin, and Oliver G. Selfridge (1961), “Learning in neural 
nets,” Proceedings of the Fourth London Symposium on Information 
Theory , C. Cherry, Editor, Academic Press, New York. 

for some discussion of the relations between convergence and hill- 
climbing. Although Minsky and Papert did not yet know one 
another, their papers in that volume overlap to the extent of prov- 
ing the same theorem about the Bayesian optimality of linear 
separation. This coincidence had no obvious connection with 
their later collaboration. 

As Agmon had clearly anticipated the learning aspect of the 
perceptron, so had Selfridge anticipated its quality of combining 
local properties to yield apparently global ones. This is seen, 
for example, in 

Selfridge, Oliver G. (1956), “Pattern recognition and learning,” Pro- 
ceedings of the Third London Symposium of Information Theory , Aca- 
demic Press, New York, p. 345. 

Incidentally, we consider that there has been a strong influence of 
these cybernetic ideas on the trend of ideas and discoveries in the 
physiology of vision represented, for example, in 

Lettvin, Jerome Y., H. Maturana, W. S. McCulloch, and W. Pitts (1959), 
“What the frog’s eye tells the frog’s brain,” Proceedings of the IRE , 
Vol. 47, pp. 1940-1951 



[284] Bibliographic Notes 


and 

Hubei, D. H., and T. N. Wiesel (1959), “Receptive fields of single 
neurons in the cat’s striate cortex,” Journal of Physiology , Vol. 148, 
pp. 574-591. 

Other ideas used in this book come from earlier models of 
physiological phenomena, notably the paper of 

Pitts, W., and W. S. McCulloch (1947), “How we know universals,” 
Bulletin of Mathematical Biophysics , Vol. 9, pp. 127-147 

which is the first we know of that treats recognition invariant 
under a group by integrating or summing predicates over the 
group. This paper and that of Lettvin et al. are reprinted in 

McCulloch, Warren S. (1965), Embodiments of Mind , The M.I.T. Press, 
Cambridge, Mass., 

and this book reprints other early attempts to pass from the local 
to the global with networks of individually simple devices. In a 
third paper reprinted in Embodiments of Mind 

McCulloch, W. S., and Walter Pitts (1943), “A logical calculus of the 
ideas immanent in neural nets,” Bulletin of Mathematical Biophysics , 
Vol. 5, pp. 115-137 

will be found the prototypes of the linear threshold functions 
themselves. Readers who are unfamiliar with this theory, or that 
of Turing machines, are directed to the elementary exposition in 

Minsky, Marvin (1967), Computation: Finite and Infinite Machines , 
Prentice-Hall, Englewood Cliffs, N.J. 

The local-global transition has dominated several biological areas 
in recent years. A most striking example is the trend in analysis 
of animal behavior associated with the name of Tinbergen, as in 
his classic 

Tinbergen, N. (1951), The Study of Instinct, Oxford, New York. 

Returning to the technical aspects of perceptrons, we find that 
our main subject is not represented at all in the literature. We 
know of no papers that either prove that a nontrivial perceptron 
cannot accomplish a given task or else show by mathematical 
analysis that a perceptron can be made to compute any significant 
geometric predicate. There is a vast literature about experimental 
results but generally these are so inconclusive that we will refrain 
from citing particular papers. In most cases that seem to show' 



Bibliographic Notes [285] 


“success,” it can be seen that the data permits an order- 1 separa- 
tion, or even a conjunctively local separation! In these cases, the 
authors do not mention this, though it seems inconceivable that 
they could not have noticed it! 

The approach closest to ours, though still quite distant, is that of 

Bledsoe, W. W., and I. Browning (1959), “Pattern recognition and 
reading by machine,” Proceedings of the Eastern Joint Computer Con- 
ference, 1959 , pp. 225-232. 

Another early paper that recognizes the curiously neglected fact 
that partial predicates work better when realistically matched 
to the problem, is 

Roberts, Lawrence G. (1960), “Pattern recognition with an adaptive 
network,” IRE International Convention Record , Part II, pp. 66-70. 

Rosenblatt made some studies (in Neurodynamics) concerning the 
probability that if a perceptron recognizes a certain class of fig- 
ures it will also recognize other figures that are similar in certain 
ways. In another paper 

Rosenblatt, Frank (1960), “Perceptual generalization over transforma- 
tion groups,” Self-Organizing Systems , Pergamon Press, New York, 
pp. 63-96. 

he considers group-invariant patterns but does not come close 
enough to the group-invariance theorem to get decisive results. 

The nearest approach to our negative results and methods is the 
analysis of ^ PARITY found in 

Dertouzos, Michael (1965), Threshold Logic: A Synthesis Approach , 
The M.I.T. Press, Cambridge, Mass. 

This is also a good book in which to see how people who are 
not interested in geometric aspects of perceptrons deal with linear 
threshold functions. They had already been interested, for other 
reasons, in the size of coefficients of (first-order) threshold func- 
tions, and we made use of an idea described in 

Myhill, John and W. H. Kautz (1961), “On the size of weights required 
for linear-input switching functions,” IRE Transactions on Electronic 
Computers , Vol. 10, No. 2, pp. 288-290 

to get our theorem in §10.1. A more recent result on order- 1 
coefficients is in 

Muroga, Saburo, and I. Toda (1966), “Lower bounds on the number of 



[286] Bibliographic Notes 


threshold functions,” IEEE Transactions on Electronic Computers , Vol. 
EC-15, No. 5, pp. 805-806, 

which improves upon an earlier result in 

Muroga, Saburo (1965), “Lower bounds on the number of threshold 
functions and a maximum weight,” IEEE Transactions on Electronic 
Computers , Vol. EC-14, No. 2, pp. 136-148. 

These papers also discuss another question: the proportion of 
Boolean functions (of ^-variables) that happens to be first-order. 
To our knowledge, there is no literature about the same question 
for higher-order functions. 

The general area of artificial intelligence and heuristic program- 
ming was mentioned briefly in Chapter 13 as the direction we feel 
one should look for advanced ideas about pattern recognition and 
learning. No systematic treatment is available of what is known in 
this area, but we can recommend a few general references. The 
collection of papers in 

Feigenbaum, Edward A., and Julian Feldman (1963), Computers and 
Thought , McGraw-Hill, New York. 

shows the state of affairs in the area up to about 1962, while 

Minsky, Marvin (1968), Semantic Information Processing , The M.ET. 
Press, Cambridge, Mass., 1968 

contains more recent papers — mainly doctoral dissertations — 
dealing with computer programs that manipulate verbal and 
symbolic descriptions. Anyone interested in this area should also 
know the classic paper 

Newell, Allen, J. C. Shaw, and H. A. Simon (1959), “Report on a 
general problem-solving program,” Proceedings of International Con- 
ference on Information Processing , UNESCO House, pp. 256-264. 

The program mentioned in Chapter 13 is described in detail in 

Guzman, Adolfo (1968), “Decomposition of a visual scene into bodies,” 
Proceedings Fall Joint Computer Conference , 1968. 

Finally, we mention two early works that had a rather broad 
influence on cybernetic thinking. The fearfully simple homeostat 
concept mentioned in §1 1.6 is described in 

Ashby, W. Ross (1952), Design for a brain , Wiley, New York 

which discussed only very simple machines, to be sure, but for the 



Bibliographic Notes [287] 


first time with relentless clarity. At the other extreme, perhaps, 
was 

Hebb, Donald O. (1949), The Organization of Behavior , Wiley, New York 

which sketched a hierarchy of concepts proposed to account for 
global states in terms of highly local neuronal events. Although 
the details of such an enterprise have never been thoroughly 
worked out, Hebb’s outline was for many workers a landmark in 
the shift from a search for a single, simple principle of brain 
organization toward more realistic attempts to construct hier- 
archies (or rather heterarchies , as McCulloch would insist) that 
could support the variety of computations evidently needed for 
thinking. 


Vote like "ta compare yovir veaAiotU "fh<* wrffo those 
o+tai r readers. Th e are sen ous <Lsto$ston s <£ the book 


Bi ock Herbert A Rev/iecu " '?e\rc£^(ran$ t . 

<xv\J CovdYo\ vol. 17 \^10 pp . SOI ^55-3.. 

; > ir . 

Kjea»el( Aden: A s4ep +°warc/ %e wJgrsM'hcj of Mptr* w+ioh 

Processes Sc igKicg- \/£>( 165 ^ 3.3. 1^69^ pjs y%0 ~~78cl 

Mycielski Jo.it Review o-f "Perce pl'eo as • 

Boll. Awer. Md+h. Soc. vol 7^ TaK 15. 


Minsky, M. Re-Vieav ok Peree^rcws . 

AT. He wo 3.^3 , AvRk'oaf I-*Uli<je»>ce > 

Cambn^e, Mass. oa. 13*?. 

fhe Bl(?cV review also coirkms or eArWisivc biUio^vA^^ 




Index 


A 

A*, a solution or separation vector, 
164, 167 
Ame. 189 

A find' 189 

Adaptive, 16, 19 
Adjacent squares, 74, 83, 87 
Agmon, S., 175. See also Bibliographic 
Notes 

Algebraic geometry, 66 
And/Or theorem, 36, 62, 86, 1 10, 228, 
240 

Area, 54, 55,62,99, 130, 132 
Armstrong, William, 245 
ARPA, 246 

Artificial intelligence, 232, 242. See also 
Heuristic programming 
Ashby, W. Ross, 81. See also Biblio- 
graphic Notes 

Associative memory, 2. See also Hash 
coding 

B 

Ball, Geofiry, 21 1 
bayes, 193, 195-205 
Beard, R., 245 
Best match, 222, 224 
BEST PLANE, 194 

Beyer, Terry, 146-148, 183, 244-245 
Bezout’s theorem, 66 
Bionics, 242 

Bledsoe, Woodrow W., 239, 245. See 
also Bibliographic Notes 
Block, H. D. See Bibliographic Notes ; l97 
Blum, Manuel, 140, 245 
Browning, Iben, 239. See also Biblio- 
raphic Notes 

C 

Cj, a stratification class, 1 15 
Center of gravity, 55 
Circle, 106 

Clark, W. A. See Bibliographic Notes 
Closed under a group, 47, 241 
Cluster, 191,211,233 
Coefficients, 27, 70, 97, 126, 159, 243 
size of, 15, 17, 18, 117, 151-160 
Collapsing theorem, 77-80 
Compact, 64, 160, 186. Infinite se- 
quences of points from a compact 
set always have limit points in the 
set. Spheres and closed line seg- 
ments are compact. The concept is 


discussed in all except the most mod- 
ern books on topology or real 
analysis. 

COMPLETE STORAGE, 189 
Component (connected), 87 
Computation time, 136, 143-150, 216 
Computer, 227, 23 1 . See also Program 
Conjunctively local, 8, 9, 11, 103, 105, 
129, 142 

Conjunctive normal form, 79 
Connectedness, 6, 8, 12, 13, 69-95, 
136-150, 232,238, 241 
Context, 98, 111-113 
Convergence theorem for perceptrons, 
15, 15, 164-187, 243 
Convexity, 6, 7, 26, 35, 103, 133, 141 
Cost of errors, 205 
Cover, Tom, 214 

Criticisms, 4, 15, 16, 165, 180, 189, 243 
Curvature, 104, 133 
Cycling theorem, 182 

D 

D (diameter), 129 
Data set, 188, 215 
Description, 233 

Dertouzos, M. See Bibliographic Notes 
Diameter-limited, 9, 12, 73, 103, 104, 
131-135 

Dilation group, 124 
Distance, 191-192, 222,225 

E 

Efron, Bradley, 183, 244 
Equivalence 
of figures, 46, 1 14, 124 
of predicates, 47 

Estimators, 206. See also Probability 
Equivalence-class, 46 
Error correction, 163 
Euler number, 69, 86, 133, 134-135, 
241 

Exact match. See match 

F 

F, 25 
F+, 166 
F , 166 
FALSE, 26 

Fano, Robert M., 245 

Farley, B. G. See Bibliographic Notes 

Feedback, 3, 162 

Feigenbaum, E. A., 232. See also 
Bibliographic Notes 



[290] Index 


Feldman, Julian, 232. See also Biblio- 
graphic Notes 
Fell, Harriet, 245 
Filter, 228 

Finite order. See Order 
Finite state, 1 5, 140 
Forgetting, 207, 215 

G 

*-*.43 
gX, 42 
G(X ), 86 

Gamba, A. ,228. See also Bibliographic 
Notes 

Gamba perceptrons, 12, 228-231 
Gamberini, L. See Bibliographic Notes 
Geometric (property), 99, 243-244 
(/-equivalence, 47 
Gestalt, 20 

Global, 2, 17, 19, 242. See also Local 
Godel number, 70 

Group, of transformations, 22, 39, 41, 
44, 96, 126 

Group-invariance theorem, 22, 48, 96, 
100, 102, 239-241 

Guzman, Adolfo, 233, 255. See also 
Bibliographic Notes 

H 

hG, 44 

Haar measure, 55 
Hall, David, 211 
Hash coding, 190, 219-221 
Hebb, Donald O., 19. See also Biblio- 
graphic Notes 
Henneman, William, 245 
Heuristic programming, 232, 233, 239 
Hewitt, Carl, 140, 245 
Hill-climbing, 163, 178,244 
Hole (in component;, 87 
Homeostat, 180, 244 
Hubei, D. H. See Bibliographic Notes 
Huffman, David, 79, 1 13, 241 
Hyperplane, 14, 195,240 

I 

/*, 137 

/(J 0, 31. The constant ( = 1) identity 
function. 

Incremental computation, 215, 225 
Independence (statistical), 200 
Infinite groups, 44, 48, 97, 99, 1 14 
Infinite sets, 1 1, 27, 37, 97, I 14, 158 
Integral, 54, 70, 133 


Invariant 
of group, 41 

topological, 86, 92 95. Definition: In- 
tuitively, any predicate that is un- 
changed when a figure is deformed 
without altering its connectedness 
properties or the inside-outside re- 
lationships among its components. 
Irreducible algebraic curve, 66 
isodata, 194, 21 1 
Iterative arrays, 146 

K 

k , K\ used for the order of a perceptron 
or the degree of a polynomial 
Kautz, W. H., 160. See also Bibliog- 
raphic Notes 

L 

L(<P), 14, 28 

Learning, 14, 15, 16, 18, 127, 149 150, 

161 226,243 244 

Lettvin, J. Y, See Bibli ographic Notes 
Likelihood ratio, 209 N-LickrL^^T.c.ft.. 2.V-6 
Linear threshold function, 27, 31 
Local, 2, 7, 10, 17, 73, 163, 235. 5^ 
also Global 
Logarithmic sort, 217 
Loop, 3, 145,231 
Lyons, L., 245 

M 

McCulloch, Warren S., 55, 79, 239, 

240. See also Bibliographic Notes 
Marill, Thomas, 240 
Maturana, H. See Bibliographic Notes 
Mask, 22, 31,35, 153, 155,240 
Match 

best, 222-226. See also Nearest neigh- 
bor 

exact, 2 1 5, 22 1 . See also Hash coding 
Maximum likelihood, 199, 202. See 
also BAYES 
Measure, 55, 228 

Memory, 136, 141, 145, 149, 215, 216, 

243, 249. See also Learning 
Metric, 7 1 

Minsky, Marvin, 232. See also Biblio- 
graphic Notes 
Moment, 55, 99 
Multilayer, 228 232 
Muroga, S. See Bibliographic Notes 



Index [291] 


Myhill, John, 160. See also Biblio- 
graphic Notes 

N 

N, (JO, 60, 107 

nearest neighbor, 150, 194-199. See 
also Match, best 
Net, 206 

Neuron, 14, 19, 210, 234 
Newell, Allen, 234. See also Biblio- 
graphic Notes 

Nilsson, Nils, 183, 244. See also 
Bibliographic Notes 
Nonseparable, 181 
Normalization, 116, 126-127 

O 

“One-in-a-box” theorem, 59 61, 69, 
75, 112 

Order, 12, 30, 35- 56, 62, 78, 239-241 
Order-restricted, 12 

P 

Palmieri, G. See Bibliographic Notes 
Papert, S. See Bibliographic Notes 
Parallel, 2, 3, 5, 17, 142 150, 241, 249. 
See also Serial 

Parity, 56-59, 83, 149, 151 158, 176, 
230, 240, 241 

Paterson, Michael, 92, 242, 245 
Pattern recognition, 1 16, 227, 242, 244 
Perception, human, 73, 238 
Perceptron, 12 . See also /,(<£) 
Permutation group, 40, 46, 56 
Perspective, 235. See also Three-dimen- 
sional predicates 
Phenomenon, mathematical, 228 
Physiology, 14 

Pitts, Walter, 55, 99, 240. See also 
Bibliographic Notes 
Polynomial of a predicate, 23, 41, 57, 
60, 63 

Positive normal form, 33, 34, 240 
Predicate, 6, 25, 26 
Predicate-scheme, 25, 37 
Preprocessing, 1 13 

Probability, 14-15, 165, 193-209, 239 
Programs, 9, 14, 136-139, 164-167, 
226, 232-238 
Pushdown list, 71 

Q 

Query word, 188 


R 

/?, Retina, 5, 25, 26 

I R I , the number of points in the retina 
Rectangle, 104, 130, 134 
Reflection symmetry, 1 17 
Reinforcement, 161,206,215 
Repeated stratification, 121 
Resolution. See Tolerance 
Restrictions (on perceptrons). 5, 9, 12, 

231 

Roberts, L. G. See Bibliographic Notes 
Rosenblatt, Frank, 19, 239. See also 
Bibliographic Notes 
Rotation group, 43, 98, 120, 127 

S 

Samuel, Arthur, 16, 209. See also 
Bibliographic Notes 
Sanna, R. See Bibliographic Notes 
Scene-analysis, 102, 232-239 
Self-organizing, 16, 19, 234 
Selfridge, Oliver G., 179, 245. See also 
Bibliographic Notes 
Serial, 17, 96, 127, 136-140, 232, 241 
243 

Shaw, J. C., 234. See also Bibliographic 
Notes 

Similarity group, 98 
Simon, Herbert A., 234. See also 
Bibliographic Notes 
Solomonoflf, Ray, 234 
Solution vector, 165 
Spectrum, 70, 77, 107, 1 10, 134 
Square 

geometric, 105, 122, 243 
unit of retina, 44, 71 
Statistics. See Probability 
Stratification, 1 14-128, 156, 227, 292 
Strauss, Dona, 76, 244, 245 
Support, 27 
Sussman, Gerald, 245 
Switch, 81, 241 
Symmetry, 1 17, 242 

T 

Template, 131, 230 
Theorem, 226, 239 

Three-dimensional predicates, 85, 148, 

232 

Threshold, 10 

Threshold logic, 58. Name for theory 
of linear threshold functions. 



[292] Index 


Tinbergen, N. See Bibliographic Notes 
Toda, I. See Bibliographic Notes 
Tolerance, 71, 72, 124, 134, 142. Allow- 
able measurement errors. 

Topology, 9, 69, 71, 85 15, 134 135, 
242 

Toroidal, 98, 1 26. See also Torus 
Torus, 17,44, 80,85,98, 126 
Transitive group, 53 55 
Translation group, 41, 46, 96, 98, 99 
101, 105 106, 1 14, 1 18, 120, 124, 159 
Translation-invariant, 56, 239 
Translation spectra, 105. See also 
Spectrum 
Triangle, 130 
true, 26 

Turing machine, 139, 142 
Twins, 114, 127, 242 

U 

Uttley, A. M. See Bibliographic Notes 
Unbounded, 127. See also Infinite 

V 

Vector, 188, 240 
Vector geometry, 165 

W 

Weight, 10 

Weight of evidence, 204, 238 
Weisel, T. N. See Bibliographic Notes 
White, John, 243 

X 

x, 25. A point of the retina R. 
xeX,27 

X, 5, 26. A picture, that is, a subset 
of/?. 

10, 49. Notation for coefficient of 
<p(X). 

a(<p), alternative notation for a tp . 

5, 26 
32 

P CIRCLE ■> ^ 

P CONNECTED ■> 6, 8, 12, 13 

P CONVEX i ^ 

’/'PARITY i 7)1 
P SYMMETRICAL ’ ^ 1 7 

v?, 26. A partial predicate. 

<p h 5, 10 

(p A (A'), 7. The predicate \A C X]. 

<£, 5, 8, 10. A family of predicates. 


<£>,, 53. An equivalence class of predi 
cates. 

<i>, a vector from a family F 
<F, unit vector, 166 
Try, 115 

= , 27. The Boolean equivalence predi 
cate. 

[ 1, 26. Brackets for predicates. 



Perceptrons 

Expanded Edition 

by Marvin L. Minsky and Seymour A. Papert 

Perceptrons — the first systematic study of parallelism in computation — 
has remained a classical work on threshold automata networks for nearly 
two decades. It marked a historical turn in artificial intelligence and is 
required reading for anyone who wants to understand the connectionist 
revival that is going on today. 

Artificial-intelligence research, which for a time concentrated on the pro- 
gramming of von Neumann computers, is swinging back to the idea that 
intelligence might emerge from the activity of networks of neuron-ilike 
entities. Minsky and Papert’s book was the first example of a mafftemati- 
cal analysis carried far enough to show the exact limitations of a class of 
computing machines that could seriously be considered as models of the 
brain. \ \ 

v 

Now the new developments in mathematical tools, the recent interest of 
physicists in the theory of disordered matter, the new insights into and 
psychological models of how the brain works, and the evolution of fast 
computers that can simulate networks of automata have given Percep- 
trons new importance. 

Minsky and Papert have added a chapter to their seminal study in which 
they discuss the current state of parallel computers, review developments 
since the appearance of the 1972 edition, and identify new research direc- 
tions related to connectionism. The central theoretical challenge facing 
connectionism, they observe, is in reaching a deeper understanding of 
how "objects" or "agents" with individuality can emerge in a network. 
Progress in this area would link connectionism with what the authors 
have called "society theories of mind.” 

Marvin L. Minsky is Donner Professor of Science in MIT's Electrical 
Engineering and Computer Science Department, and Seymour A. Papert 
is Professor of Media Technology at MIT. 



The MIT Press 

Massachusetts Institute of Technology 
Cambridge. Massachusetts 02142 


MINPPR 

0-262-63111-3